Download full Intelligent data analysis: from data gathering to data comprehension deepak gupta eboo

Page 1


Intelligent Data Analysis: From Data Gathering to Data Comprehension Deepak Gupta

Visit to download the full and correct content document: https://ebookmass.com/product/intelligent-data-analysis-from-data-gathering-to-datacomprehension-deepak-gupta/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Intelligent Data-Analytics for Condition Monitoring

Malik

https://ebookmass.com/product/intelligent-data-analytics-forcondition-monitoring-malik/

Intelligent Multi-Modal Data Processing (The Wiley Series in Intelligent Signal and Data Processing) 1st Edition Soham Sarkar

https://ebookmass.com/product/intelligent-multi-modal-dataprocessing-the-wiley-series-in-intelligent-signal-and-dataprocessing-1st-edition-soham-sarkar/

Intelligent Systems and Learning Data Analytics in Online Education: A volume in Intelligent Data-Centric Systems Santi Caballé

https://ebookmass.com/product/intelligent-systems-and-learningdata-analytics-in-online-education-a-volume-in-intelligent-datacentric-systems-santi-caballe/

Wearable Sensing and Intelligent Data Analysis for Respiratory Management Rui Pedro Paiva

https://ebookmass.com/product/wearable-sensing-and-intelligentdata-analysis-for-respiratory-management-rui-pedro-paiva/

Volume III: Data Storage, Data Processing and Data Analysis 1st ed. 2021 Edition Volker Liermann (Editor)

https://ebookmass.com/product/volume-iii-data-storage-dataprocessing-and-data-analysis-1st-ed-2021-edition-volker-liermanneditor/

Data Science With Rust: A Comprehensive Guide - Data Analysis, Machine Learning, Data Visualization & More Van Der Post

https://ebookmass.com/product/data-science-with-rust-acomprehensive-guide-data-analysis-machine-learning-datavisualization-more-van-der-post/

Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye

https://ebookmass.com/product/exploratory-data-analysis-withpython-cookbook-over-50-recipes-to-analyze-visualize-and-extractinsights-from-structured-and-unstructured-data-oluleye/

Data Wrangling on AWS: Clean and organize complex data for analysis Shukla

https://ebookmass.com/product/data-wrangling-on-aws-clean-andorganize-complex-data-for-analysis-shukla/

Big Data Management and Analytics Brij B Gupta & Mamta

https://ebookmass.com/product/big-data-management-and-analyticsbrij-b-gupta-mamta/

IntelligentDataAnalysis

FromDataGatheringtoDataComprehension

Editedby DeepakGupta

MaharajaAgrasenInstituteofTechnology Delhi,India

SiddharthaBhattacharyya CHRIST(DeemedtobeUniversity) Bengaluru,India

AshishKhanna

MaharajaAgrasenInstituteofTechnology Delhi,India

KalpnaSagar KIETGroupofInstitutions UttarPradesh,India

Thiseditionfirstpublished2020 ©2020JohnWiley&SonsLtd

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby law.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableathttp://www.wiley.com/go/ permissions.

TherightofDeepakGupta,SiddharthaBhattacharyya,AshishKhanna,andKalpnaSagartobeidentifiedasthe authorsoftheeditorialmaterialinthisworkhasbeenassertedinaccordancewithlaw.

RegisteredOffices

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK

EditorialOffice

TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK

Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitusat www.wiley.com.

Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.

LimitofLiability/DisclaimerofWarranty

MATLABⓇ isatrademarkofTheMathWorks,Inc.andisusedwithpermission.TheMathWorksdoesnotwarrant theaccuracyofthetextorexercisesinthisbook.Thiswork’suseordiscussionofMATLABⓇ softwareorrelated productsdoesnotconstituteendorsementorsponsorshipbyTheMathWorksofaparticularpedagogicalapproach orparticularuseoftheMATLABⓇ software.

Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentations orwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkandspecificallydisclaim allwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityorfitnessforaparticular purpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterialsorpromotional statementsforthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasacitation and/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherandauthorsendorsethe informationorservicestheorganization,website,orproductmayprovideorrecommendationsitmaymake.This workissoldwiththeunderstandingthatthepublisherisnotengagedinrenderingprofessionalservices.The adviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaspecialist whereappropriate.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernorauthorsshallbe liableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental, consequential,orotherdamages.

LibraryofCongressCataloging-in-PublicationData

Names:Gupta,Deepak,editor.

Title:Intelligentdataanalysis:fromdatagatheringtodata comprehension/editedbyDr.DeepakGupta,Dr.Siddhartha Bhattacharyya,Dr.AshishKhanna,Ms.KalpnaSagar.

Description:Hoboken,NJ,USA:Wiley,2020.|Series:TheWileyseriesin intelligentsignalanddataprocessing|Includesbibliographical referencesandindex.

Identifiers:LCCN2019056735(print)|LCCN2019056736(ebook)|ISBN 9781119544456(hardback)|ISBN9781119544449(adobepdf)|ISBN 9781119544463(epub)

Subjects:LCSH:Datamining.|Computationalintelligence.

Classification:LCCQA76.9.D343I574352020(print)|LCCQA76.9.D343 (ebook)|DDC006.3/12–dc23

LCrecordavailableathttps://lccn.loc.gov/2019056735

LCebookrecordavailableathttps://lccn.loc.gov/2019056736

CoverDesign:Wiley

CoverImage:©gremlin/GettyImages

Setin9.5/12.5ptSTIXTwoTextbySPiGlobal,Chennai,India

DeepakGuptawouldliketodedicatethisbooktohisfather,Sh.R.K.Gupta,hismother, Smt.GeetaGupta,hismentorsfortheirconstantencouragement,andhisfamilymembers, includinghiswife,brothers,sisters,kidsandthestudents.

SiddharthaBhattacharyyawouldliketodedicatethisbooktohisparents,thelateAjitKumar BhattacharyyaandthelateHashiBhattacharyya,hisbelovedwife,Rashni,andhisresearch scholars,Sourav,Sandip,Hrishikesh,Pankaj,Debanjan,Alokananda,Koyel,andTulika.

AshishKhannawouldliketodedicatethisbooktohisparents,thelateR.C.Khannaand Smt.SurekhaKhanna,fortheirconstantencouragementandsupport,andtohiswife, Sheenu,andchildren,MasterBhavyaandMasterSanyukt.

KalpnaSagarwouldliketodedicatethisbooktoherfather,Mr.LekhRamSagar,andher mother,Smt.GomtiSagar,thestrongestpersonsofherlife.

Contents

ListofContributors xix

SeriesPreface xxiii

Preface xxv

1IntelligentDataAnalysis:BlackBoxVersusWhiteBoxModeling 1

SarthakGupta,SiddhantBagga,andDeepakKumarSharma

1.1Introduction 1

1.1.1IntelligentDataAnalysis 1

1.1.2ApplicationsofIDAandMachineLearning 2

1.1.3WhiteBoxModelsVersusBlackBoxModels 2

1.1.4ModelInterpretability 3

1.2InterpretationofWhiteBoxModels 3

1.2.1LinearRegression 3

1.2.2DecisionTree 5

1.3InterpretationofBlackBoxModels 7

1.3.1PartialDependencePlot 7

1.3.2IndividualConditionalExpectation 9

1.3.3AccumulatedLocalEffects 9

1.3.4GlobalSurrogateModels 12

1.3.5LocalInterpretableModel-AgnosticExplanations 12

1.3.6FeatureImportance 12

1.4IssuesandFurtherChallenges 13

1.5Summary 13 References 14

2Data:ItsNatureandModernDataAnalyticalTools 17

RavinderAhuja,ShikharAsthana,AyushAhuja,andManuAgarwal

2.1Introduction 17

2.2DataTypesandVariousFileFormats 18

2.2.1StructuredData 18

2.2.2Semi-StructuredData 20

2.2.3UnstructuredData 20

2.2.4NeedforFileFormats 21

2.2.5VariousTypesofFileFormats 22

2.2.5.1CommaSeparatedValues(CSV) 22

2.2.5.2ZIP 22

2.2.5.3PlainText(txt) 23

2.2.5.4JSON 23

2.2.5.5XML 23

2.2.5.6ImageFiles 24

2.2.5.7HTML 24

2.3OverviewofBigData 25

2.3.1SourcesofBigData 27

2.3.1.1Media 27

2.3.1.2TheWeb 27

2.3.1.3Cloud 27

2.3.1.4InternetofThings 27

2.3.1.5Databases 27

2.3.1.6Archives 28

2.3.2BigDataAnalytics 28

2.3.2.1DescriptiveAnalytics 28

2.3.2.2PredictiveAnalytics 28

2.3.2.3PrescriptiveAnalytics 29

2.4DataAnalyticsPhases 29

2.5DataAnalyticalTools 30

2.5.1MicrosoftExcel 30

2.5.2ApacheSpark 33

2.5.3OpenRefine 34

2.5.4RProgramming 35

2.5.4.1AdvantagesofR 36

2.5.4.2DisadvantagesofR 36

2.5.5Tableau 36

2.5.5.1HowTableauWorks 36

2.5.5.2TableauFeature 37

2.5.5.3Advantages 37

2.5.5.4Disadvantages 37

2.5.6Hadoop 37

2.5.6.1BasicComponentsofHadoop 38

2.5.6.2Benefits 38

2.6DatabaseManagementSystemforBigDataAnalytics 38

2.6.1HadoopDistributedFileSystem 38

2.6.2NoSql 38

2.6.2.1CategoriesofNoSql 39

2.7ChallengesinBigDataAnalytics 39

2.7.1StorageofData 40

2.7.2SynchronizationofData 40

2.7.3SecurityofData 40

2.7.4FewerProfessionals 40

2.8Conclusion 40 References 41

3StatisticalMethodsforIntelligentDataAnalysis:Introduction andVariousConcepts 43 ShubhamKumaram,SamarthChugh,andDeepakKumarSharma

3.1Introduction 43

3.2Probability 43

3.2.1Definitions 43

3.2.1.1RandomExperiments 43

3.2.1.2Probability 44

3.2.1.3ProbabilityAxioms 44

3.2.1.4ConditionalProbability 44

3.2.1.5Independence 44

3.2.1.6RandomVariable 44

3.2.1.7ProbabilityDistribution 45

3.2.1.8Expectation 45

3.2.1.9VarianceandStandardDeviation 45

3.2.2Bayes’Rule 45

3.3DescriptiveStatistics 46

3.3.1PictureRepresentation 46

3.3.1.1FrequencyDistribution 46

3.3.1.2SimpleFrequencyDistribution 46

3.3.1.3GroupedFrequencyDistribution 46

3.3.1.4StemandLeafDisplay 46

3.3.1.5HistogramandBarChart 47

3.3.2MeasuresofCentralTendency 47

3.3.2.1Mean 47

3.3.2.2Median 47

3.3.2.3Mode 47

3.3.3MeasuresofVariability 48

3.3.3.1Range 48

3.3.3.2BoxPlot 48

3.3.3.3VarianceandStandardDeviation 48

3.3.4SkewnessandKurtosis 48

3.4InferentialStatistics 49

3.4.1FrequentistInference 49

3.4.1.1PointEstimation 50

3.4.1.2IntervalEstimation 50

3.4.2HypothesisTesting 51

3.4.3StatisticalSignificance 51

3.5StatisticalMethods 52

3.5.1Regression 52

3.5.1.1LinearModel 52

3.5.1.2NonlinearModels 52

3.5.1.3GeneralizedLinearModels 53

3.5.1.4AnalysisofVariance 53

3.5.1.5MultivariateAnalysisofVariance 55

x Contents

3.5.1.6Log-LinearModels 55

3.5.1.7LogisticRegression 56

3.5.1.8RandomEffectsModel 56

3.5.1.9Overdispersion 57

3.5.1.10HierarchicalModels 57

3.5.2AnalysisofSurvivalData 57

3.5.3PrincipalComponentAnalysis 58

3.6Errors 59

3.6.1ErrorinRegression 60

3.6.2ErrorinClassification 61

3.7Conclusion 61 References 61

4IntelligentDataAnalysiswithDataMining:Theoryand Applications 63 ShivamBachhety,RamneekSinghal,andRachnaJain Objective 63

4.1IntroductiontoDataMining 63

4.1.1ImportanceofIntelligentDataAnalyticsinBusiness 64

4.1.2ImportanceofIntelligentDataAnalyticsinHealthCare 65

4.2DataandKnowledge 65

4.3DiscoveringKnowledgeinDataMining 66

4.3.1ProcessMining 67

4.3.2ProcessofKnowledgeDiscovery 67

4.4DataAnalysisandDataMining 69

4.5DataMining:Issues 69

4.6DataMining:SystemsandQueryLanguage 71

4.6.1DataMiningSystems 71

4.6.2DataMiningQueryLanguage 72

4.7DataMiningMethods 73

4.7.1Classification 74

4.7.2ClusterAnalysis 75

4.7.3Association 75

4.7.4DecisionTreeInduction 76

4.8DataExploration 77

4.9DataVisualization 80

4.10ProbabilityConceptsforIntelligentDataAnalysis(IDA) 83 Reference 83

5IntelligentDataAnalysis:DeepLearningandVisualization 85 ThanD.LeandHuyV.Pham

5.1Introduction 85

5.2DeepLearningandVisualization 86

5.2.1LinearandLogisticRegressionandVisualization 86

5.2.2CNNArchitecture 89

5.2.2.1VanishingGradientProblem 90

5.2.2.2ConvolutionalNeuralNetworks(CNNs) 91

5.2.3ReinforcementLearning 91

5.2.4InceptionandResNetNetworks 93

5.2.5Softmax 94

5.3DataProcessingandVisualization 97

5.3.1RegularizationforDeepLearningandVisualization 98

5.3.1.1RegularizationforLinearRegression 98

5.4ExperimentsandResults 102

5.4.1MaskRCNNBasedonObjectDetectionandSegmentation 102

5.4.2DeepMatrixFactorization 108

5.4.2.1NetworkVisualization 108

5.4.3DeepLearningandReinforcementLearning 111

5.5Conclusion 112 References 113

6ASystematicReviewontheEvolutionofDentalCariesDetection MethodsandItsSignificanceinDataAnalysisPerspective 115 SomaDatta,NabenduChaki,andBiswajitModak

6.1Introduction 115

6.1.1AnalysisofDentalCaries 115

6.2DifferentCariesLesionDetectionMethodsandDataCharacterization 119

6.2.1PointDetectionMethod 120

6.2.2VisibleLightPropertyMethod 121

6.2.3Radiographs 121

6.2.4Light-EmittingDevices 123

6.2.5OpticalCoherentTomography(OCT) 125

6.2.6SoftwareTools 125

6.3TechnicalChallengeswiththeExistingMethods 126

6.3.1ChallengesinDataAnalysisPerspective 127

6.4ResultAnalysis 129

6.5Conclusion 129 Acknowledgment 131 References 131

7IntelligentDataAnalysisUsingHadoopCluster–Inspired MapReduceFrameworkandAssociationRuleMiningonEducational Domain 137 PratiyushGuleriaandManuSood

7.1Introduction 137

7.1.1ResearchAreasofIDA 138

7.1.2TheNeedforIDAinEducation 139

7.2LearningAnalyticsinEducation 139

7.2.1RoleofWeb-EnabledandMobileComputinginEducation 141

7.2.2BenefitsofLearningAnalytics 142

7.2.3FutureResearchDirectionsofIDA 142

7.3Motivation 142

7.4LiteratureReview 143

7.4.1AssociationRuleMiningandBigData 143

7.5IntelligentDataAnalyticalTools 145

7.6IntelligentDataAnalyticsUsingMapReduceFrameworkinanEducational Domain 149

7.6.1DataDescription 149

7.6.2Objective 150

7.6.3ProposedMethodology 150

7.6.3.1Stage1MapReduceAlgorithm 150

7.6.3.2Stage2AprioriAlgorithm 150

7.7Results 151

7.8ConclusionandFutureScope 153 References 153

8InfluenceofGreenSpaceonGlobalAirQualityMonitoring:Data AnalysisUsingK-MeansClusteringAlgorithm 157 GihanS.PathiranaandMalkaN.Halgamuge

8.1Introduction 157

8.2MaterialandMethods 159

8.2.1DataCollection 159

8.2.2DataInclusionCriteria 159

8.2.3DataPreprocessing 159

8.2.4DataAnalysis 161

8.3Results 161

8.4QuantitativeAnalysis 163

8.4.1K-MeansClustering 163

8.4.2LevelofDifferenceofGreenArea 167

8.5Discussion 167

8.6Conclusion 169 References 170

9IDAwithSpaceTechnologyandGeographicInformationSystem 173 BrightKeswani,TariniCh.Mishra,AmbarishG.Mohapatra,PoonamKeswani, PriyatoshSahu,andAnishKumarSarangi

9.1Introduction 173

9.1.1Real-TimeinSpace 176

9.1.2GeneratingProgrammingTriggers 178

9.1.3AnalyticalArchitecture 178

9.1.4RemoteSensingBigDataAcquisitionUnit(RSDU) 180

9.1.5DataProcessingUnit 180

9.1.6DataAnalysisandDecisionUnit 181

9.1.7Analysis 181

9.1.8IncorporatingMachineLearningandArtificialIntelligence 181

9.1.8.1MethodologiesApplicable 182

9.1.8.2SupportVectorMachines(SVM)andCross-Validation 182

9.1.8.3MassivelyParallelComputingandI/O 183

9.1.8.4DataArchitectureandGovernance 183

9.1.9Real-TimeSpacecraftDetection 185

9.1.9.1ActivePhasedArray 186

9.1.9.2RelayCommunication 186

9.1.9.3Low-LatencyRandomAccess 186

9.1.9.4ChannelModelingandPrediction 186

9.2GeospatialTechniques 187

9.2.1TheBig-GIS 187

9.2.2TechnologiesApplied 187

9.2.2.1InternetofThingsandSensorWeb 188

9.2.2.2CloudComputing 188

9.2.2.3StreamProcessing 188

9.2.2.4BigDataAnalytics 188

9.2.2.5CoordinatedObservation 188

9.2.2.6BigGeospatialDataManagement 189

9.2.2.7ParallelGeocomputationFramework 189

9.2.3DataCollectionUsingGIS 189

9.2.3.1NoSQLDatabases 190

9.2.3.2ParallelProcessing 190

9.2.3.3KnowledgeDiscoveryandIntelligentService 190

9.2.3.4DataAnalysis 191

9.3ComparativeAnalysis 192

9.4Conclusion 192 References 194

10ApplicationofIntelligentDataAnalysisinIntelligentTransportation SystemUsingIoT 199

RakeshRoshanandOmPrakashRishi

10.1IntroductiontoIntelligentTransportationSystem(ITS) 199

10.1.1WorkingofIntelligentTransportationSystem 201

10.1.2ServicesofIntelligentTransportationSystem 201

10.1.3AdvantagesofIntelligentTransportationSystem 203

10.2IssuesandChallengesofIntelligentTransportationSystem(ITS) 204

10.2.1CommunicationTechnologyUsedCurrentlyinITS 205

10.2.2ChallengesintheImplementationofITS 206

10.2.3OpportunityforPopularityofAutomated/Autonomous/Self-Driving CarorVehicle 207

10.3IntelligentDataAnalysisMakesanIoT-BasedTransportation SystemIntelligent 208

10.3.1IntroductiontoIntelligentDataAnalysis 208

10.3.2HowIDAMakesIoT-BasedTransportationSystemsIntelligent 210

10.3.2.1TrafficManagementThroughIoTandIntelligentDataAnalysis 210

10.3.2.2TrackingofMultipleVehicles 211

10.4IntelligentDataAnalysisforSecurityinIntelligentTransportationSystem 212

10.5ToolstoSupportIDAinanIntelligentTransportationSystem 215 References 217

11ApplyingBigDataAnalyticsonMotorVehicleCollisionPredictions inNewYorkCity 219 DhanushkaAbeyratneandMalkaN.Halgamuge

11.1Introduction 219

11.1.1OverviewofBigDataAnalyticsonMotorVehicleCollisionPredictions 219

11.2MaterialsandMethods 220

11.2.1CollectionofRawData 220

11.2.2DataInclusionCriteria 220

11.2.3DataPreprocessing 220

11.2.4DataAnalysis 221

11.3ClassificationAlgorithmsandK-FoldValidationUsingDataSetObtainedfrom NYPD(2012–2017) 223

11.3.1ClassificationAlgorithms 223

11.3.1.1k-FoldCross-Validation 223

11.3.2StatisticalAnalysis 225

11.4Results 225

11.4.1MeasuredProcessingTimeandAccuracyofEachClassifier 225

11.4.2Measured p-ValueineachVehicleGroupUsingK-MeansClustering/One-Way ANOVA 227

11.4.3IdentifiedHighCollisionConcentrationLocationsofEachVehicleGroup 229

11.4.4MeasuredDifferentCriteriaforFurtherAnalysisofNYPDDataSet (2012–2017) 229

11.5Discussion 233

11.6Conclusion 237 References 238

12ASmartandPromisingNeurologicalDisorderDiagnosticSystem: AnAmalgamationofBigData,IoT,andEmergingComputing Techniques 241 PrableenKaurandManikSharma

12.1Introduction 241

12.1.1DifferenceBetweenNeurologicalandPsychologicalDisorders 241

12.2StatisticsofNeurologicalDisorders 243

12.3EmergingComputingTechniques 244

12.3.1InternetofThings 244

12.3.2BigData 245

12.3.3SoftComputingTechniques 245

12.4RelatedWorksandPublicationTrendsofArticles 249

12.5TheNeedforNeurologicalDisordersDiagnosticSystem 251

12.5.1DesignofSmartandIntelligentNeurologicalDisordersDiagnosticSystem 251

12.6Conclusion 259 References 260

13Comments-BasedAnalysisofaBugReportCollectionSystem andItsApplications 265 ArvinderKaurandShubhraGoyal

13.1Introduction 265

13.2Background 267

13.2.1IssueTrackingSystem 267

13.2.2BugReportStatistics 267

13.3RelatedWork 268

13.3.1DataExtractionProcess 268

13.3.2ApplicationsofBugReportComments 270

13.3.2.1BugSummarization 270

13.3.2.2EmotionMining 271

13.4DataCollectionProcess 272

13.4.1StepsofDataExtraction 273

13.4.2BlockDiagramforDataExtraction 274

13.4.3ReportsGenerated 274

13.4.3.1BugAttributeReport 274

13.4.3.2LongDescriptionReport 275

13.4.3.3BugCommentsReports 275

13.4.3.4ErrorReport 275

13.5AnalysisofBugReports 275

13.5.1ResearchQuestion1:IsthePerformanceofSoftwareAffectedbyOpenBugs thatareCriticalinNature? 275

13.5.2ResearchQuestion2:HowCanTestLeadsImprovethePerformanceofSoftware Systems? 277

13.5.3ResearchQuestion3:WhichAretheMostError-ProneAreasthatCanCause SystemFailure? 277

13.5.4ResearchQuestion4:WhichAretheMostFrequentWordsandKeywordsto PredictMostCriticalBugs? 279

13.5.5ResearchQuestions5:WhatIstheImportanceofFrequentWordsMinedfrom BugReports? 281

13.6ThreatstoValidity 284

13.7Conclusion 284 References 286

14SarcasmDetectionAlgorithmsBasedonSentimentStrength 289 PragyaKatyayanandNisheethJoshi

14.1Introduction 289

14.2LiteratureSurvey 291

14.3Experiment 294

14.3.1DataCollection 294

14.3.2FindingSentiStrengths 294

14.3.3ProposedAlgorithm 295

14.3.4ExplanationoftheAlgorithms 297

14.3.5Classification 300

14.3.5.1Explanation 300

14.3.6Evaluation 302

14.4ResultsandEvaluation 303

14.5Conclusion 305 References 305

15SNAP:SocialNetworkAnalysisUsingPredictiveModeling 307 SamridhiSethandRahulJohari

15.1Introduction 307

15.1.1TypesofPredictiveAnalyticsModels 307

15.1.2PredictiveAnalyticsTechniques 308

15.1.2.1RegressionTechniques 308

15.1.2.2MachineLearningTechniques 308

15.2LiteratureSurvey 309

15.3ComparativeStudy 313

15.4SimulationandAnalysis 313

15.4.1FewAnalysesMadeontheDataSetAreGivenBelow 314

15.4.1.1DurationofEachContactWasFound 314

15.4.1.2TotalNumberofContactsofSourceNodewithDestinationNodeWasFoundfor allNodes 314

15.4.1.3TotalDurationofContactofSourceNodewithEachNodeWasFound 315

15.4.1.4MobilityPatternDescribesDirectionofContactandRelationBetweenNumber ofContactsandDurationofContact 315

15.4.1.5UnidirectionalContact,thatis,Only1NodeisContactingSecondNodebut ViceVersaIsNotThere 317

15.4.1.6GraphicalRepresentationfortheDurationofContactswithEachNodeisGiven below 317

15.4.1.7RankandPercentileforNumberofContactswithEachNode 320

15.4.1.8DataSetIsDescribedforThreeDaysWhereTimeIsCalculatedinSeconds. DataSetcanbeDividedIntoThreeDays.SomeoftheAnalysesConductedon theDatasetDayWiseAreGivenBelow 326

15.5ConclusionandFutureWork 329

References 329

16IntelligentDataAnalysisforMedicalApplications 333 MoolchandSharma,VikasChaudhary,PrernaSharma,andR.S.Bhatia

16.1Introduction 333

16.1.1IDA(IntelligentDataAnalysis) 335

16.1.1.1ElicitationofBackgroundKnowledge 337

16.1.2MedicalApplications 337

16.2IDANeedsinMedicalApplications 338

16.2.1PublicHealth 339

16.2.2ElectronicHealthRecord 339

16.2.3PatientProfileAnalytics 339

16.2.3.1Patient’sProfile 339

16.3IDAMethodsClassifications 339

16.3.1DataAbstraction 339

16.3.2DataMiningMethod 340

16.3.3TemporalDataMining 341

16.4IntelligentDecisionSupportSysteminMedicalApplications 341

16.4.1NeedforIntelligentDecisionSystem(IDS) 342

16.4.2UnderstandingIntelligentDecisionSupport:SomeDefinitions 342

16.4.3Advantages/DisadvantagesofIDS 344

16.5Conclusion 345 References 345

17BruxismDetectionUsingSingle-ChannelC4-A1onHumanSleepS2 StageRecording 347 MdBelalBinHeyat,DakunLai,FaijanAkhtar,MohdAmmarBinHayat,Shafan Azad,ShadabAzad,andShajanAzad

17.1Introduction 347

17.1.1SideEffectofPoorSnooze 348

17.2HistoryofSleepDisorder 349

17.2.1ClassificationofSleepDisorder 349

17.2.2SleepStagesoftheHuman 351

17.3ElectroencephalogramSignal 351

17.3.1ElectroencephalogramGeneration 351

17.3.1.1ClassificationofElectroencephalogramSignal 352

17.4EEGDataMeasurementTechnique 352

17.4.110–20ElectrodePositioningSystem 352

17.4.1.1ProcedureofElectrodeplacement 353

17.5LiteratureReview 354

17.6SubjectsandMethodology 354

17.6.1DataCollection 354

17.6.2LowPassFilter 355

17.6.3HanningWindow 355

17.6.4WelchMethod 356

17.7DataAnalysisoftheBruxismandNormalDataUsingEEGSignal 356

17.8Result 358

17.9Conclusions 361

Acknowledgments 363 References 364

18HandwritingAnalysisforEarlyDetectionofAlzheimer’sDisease 369 RajibSaha,AnirbanMukherjee,AniruddhaSadhukhan,AnishaRoy,andManashiDe

18.1IntroductionandBackground 369

18.2ProposedWorkandMethodology 376

18.3ResultsandDiscussions 379

18.3.1CharacterSegmentation 380

18.4Conclusion 384 References 385

Index 387

ListofContributors

AmbarishG.Mohapatra SiliconInstituteofTechnology Bhubaneswar

India

AnirbanMukherjee RCCInstituteofInformationTechnology WestBengal

India

AniruddhaSadhukhan RCCInstituteofInformationTechnology WestBengal

India

AnishaRoy RCCInstituteofInformationTechnology WestBengal

India

ArvinderKaur

GuruGobindSinghIndraprastha University

India

AyushAhuja JaypeeInstituteofInformationTechnology Noida

India

BiswajitModak NabadwipStateGeneralHospital Nabadwip

India

R.S.Bhatia NationalInstituteofTechnology

Kurukshetra

India

BrightKeswani SureshGyanViharUniversity Jaipur

India

DakunLai UniversityofElectronicScienceand TechnologyofChina

Chengdu

China

DeepakKumarSharma NetajiSubhasUniversityofTechnology NewDelhi

India

DhanushkaAbeyratne Yellowfin(HQ) TheUniversityofMelbourne

Australia

FaijanAkhtar JamiaHamdard

NewDelhi

India

xx ListofContributors

GihanS.Pathirana CharlesSturtUniversity

Melbourne

Australia

HuyV.Pham TonDucThangUniversity

Vietnam

MalkaN.Halgamuge TheUniversityofMelbourne

Australia

ManashiDe TechnoIndia WestBengal

India

ManikSharma DAVUniversity Jalandhar

India

ManuAgarwal JaypeeInstituteofInformationTechnology Noida

India

ManuSood UniversityShimla

India

MdBelalBinHeyat UniversityofElectronicScienceand TechnologyofChina Chengdu

China

MohdAmmarBinHayat MedicalUniversity

India

MoolchandSharma MaharajaAgrasenInstituteofTechnology (MAIT)

Delhi

India

NabenduChaki UniversityofCalcutta Kolkata

India

NisheethJoshi BanasthaliVidyapith Rajasthan

India

OmPrakashRishi UniversityofKota

India

PoonamKeswani AkashdeepPGCollege Jaipur

India

PrableenKaur DAVUniversity Jalandhar

India

PragyaKatyayan BanasthaliVidyapith Rajasthan

India

PratiyushGuleria UniversityShimla

India

PrernaSharma MaharajaAgrasenInstituteofTechnology (MAIT)

Delhi

India

RachnaJain

BharatiVidyapeeth’sCollegeof Engineering NewDelhi

India

RahulJohari GGSIPUniversity NewDelhi

India

RajibSaha RCCInstituteofInformationTechnology WestBengal

India

RakeshRoshan InstituteofManagementStudies Ghaziabad

India

RamneekSinghal

BharatiVidyapeeth’sCollegeof Engineering NewDelhi

India

RavinderAhuja JaypeeInstituteofInformationTechnology Noida India

SamarthChugh NetajiSubhasUniversityofTechnology NewDelhi

India

SamridhiSeth GGSIPUniversity NewDelhi

India

ListofContributors

SarthakGupta NetajiSubhasUniversityofTechnology NewDelhi

India

ShadabAzad ChaudharyCharanSinghUniversity Meerut

India

ShafanAzad Dr.A.P.J.AbdulKalamTechnical University

UttarPradesh

India

ShajanAzad HayatInstituteofNursing Lucknow

India

ShikharAsthana JaypeeInstituteofInformationTechnology Noida

India

ShivamBachhety BharatiVidyapeeth’sCollegeof Engineering NewDelhi

India ShubhamKumaram NetajiSubhasUniversityofTechnology NewDelhi

India

ShubhraGoyal GuruGobindSinghIndraprastha University

India

xxii ListofContributors

SiddhantBagga

NetajiSubhasUniversityofTechnology

NewDelhi India

SomaDatta UniversityofCalcutta Kolkata India

TariniCh.Mishra SiliconInstituteofTechnology

Bhubaneswar India

ThanD.Le UniversityofBordeaux France

VikasChaudhary KIET

Ghaziabad India

SeriesPreface

Dr.SiddharthaBhattacharyya,CHRIST(Deemedtobe University),Bengaluru,India(SeriesEditor)

TheIntelligentSignalandDataProcessing(ISDP)bookseriesisaimedatfostering thefieldofsignalanddataprocessing,whichencompassesthetheoryandpracticeof algorithmsandhardwarethatconvertsignalsproducedbyartificialornaturalmeansinto aformusefulforaspecificpurpose.Thesignalsmightbespeech,audio,images,video, sensordata,telemetry,electrocardiograms,orseismicdata,amongothers.Thepossible applicationareasincludetransmission,display,storage,interpretation,classification, segmentation,ordiagnosis.TheprimaryobjectiveoftheISDPbookseriesistoevolve future-generationscalableintelligentsystemsforfaithfulanalysisofsignalsanddata. ISDPismainlyintendedtoenrichthescholarlydiscourseonintelligentsignalandimage processingindifferentincarnations.ISDPwillbenefitawiderangeoflearners,including students,researchers,andpractitioners.Thestudentcommunitycanusethevolumesin theseriesasreferencetextstoadvancetheirknowledgebase.Inaddition,themonographs willalsocomeinhandytotheaspiringresearcherbecauseofthevaluablecontributions bothhavemadeinthisfield.Moreover,bothfacultymembersanddatapractitionersare likelytograspdepthoftherelevantknowledgebasefromthesevolumes.

Theseriescoveragewillcontain,notexclusively,thefollowing:

1.Intelligentsignalprocessing

a)Adaptivefiltering

b)Learningalgorithmsforneuralnetworks

c)Hybridsoft-computingtechniques

d)Spectrumestimationandmodeling

2.Imageprocessing

a)Imagethresholding

b)Imagerestoration

c)Imagecompression

d)Imagesegmentation

e)Imagequalityevaluation

f)Computervisionandmedicalimaging

g)Imagemining

h)Patternrecognition

i)Remotesensingimagery

j)Underwaterimageanalysis

k)Gestureanalysis

l)Humanmindanalysis

m)Multidimensionalimageanalysis

3.Speechprocessing

a)Modeling

b)Compression

c)Speechrecognitionandanalysis

4.Videoprocessing

a)Videocompression

b)Analysisandprocessing

c)3Dvideocompression

d)Targettracking

e)Videosurveillance

f)Automatedanddistributedcrowdanalytics

g)Stereo-to-autostereoscopic3Dvideoconversion

h)Virtualandaugmentedreality

5.Dataanalysis

a)Intelligentdataacquisition

b)Datamining

c)Exploratorydataanalysis

d)Modelingandalgorithms

e)Bigdataanalytics

f)Businessintelligence

g)Smartcitiesandsmartbuildings

h)Multiwaydataanalysis

i)Predictiveanalytics

j)Intelligentsystems

Preface

Intelligentdataanalysis(IDA),knowledgediscovery,anddecisionsupporthaverecently becomemorechallengingresearchfieldsandhavegainedmuchattentionamongalarge numberofresearchersandpractitioners.Inourview,theawarenessofthesechallenging researchfieldsandemergingtechnologiesamongtheresearchcommunitywillincrease theapplicationsinbiomedicalscience.Thisbookaimstopresentthevariousapproaches, techniques,andmethodsthatareavailableforIDA,andtopresentcasestudiesoftheir application.

Thisvolumecomprises18chaptersfocusingonthelatestadvancesinIDAtoolsandtechniques.

Machinelearningmodelsarebroadlycategorizedintotwotypes:whiteboxandblackbox. Duetothedifficultyininterpretingtheirinnerworkings,somemachinelearningmodels areconsideredblackboxmodels.Chapter1focusesonthedifferentmachinelearningmodels,alongwiththeiradvantagesandlimitationsasfarastheanalysisofdataisconcerned. Withtheadvancementoftechnology,theamountofdatageneratedisverylarge.The datageneratedhasusefulinformationthatneedstobegatheredbydataanalyticstoolsin ordertomakebetterdecisions.InChapter2,thedefinitionofdataanditsclassifications basedondifferentfactorsisgiven.Thereaderwilllearnabouthowandwhatdataisand aboutthebreakupofthedata.Afteradescriptionofwhatdatais,thechapterwillfocuson definingandexplainingbigdataandthevariouschallengesfacedbydealingwithbigdata. Theauthorsalsodescribevarioustypesofanalyticsthatcanbeperformedonlargedata andsixdataanalyticstools(MicrosoftExcel,ApacheSpark,OpenRefine,R,Hadoop,and Tableau).

Inrecentyears,thewidespreaduseofcomputersandtheinternethasledtothegenerationofdataonanunprecedentedscale.Tomakeaneffectiveuseofthisdata,itisnecessary thatdatamustbecollectedandanalyzedsothatinferencescanbemadetoimprovevariousproductsandservices.Statisticsdealswiththecollection,organization,andanalysisof data.TheorganizationanddescriptionofdataisstudiedunderthesestatisticsinChapter 3whileanalysisofdataandhowtomakepredictionsbasedonitisdealtwithininferential statistics.

AfterhavinganideaaboutvariousaspectsofIDAinthepreviouschapters,Chapter4 dealswithanoverviewofdatamining.Italsodiscussestheprocessofknowledgediscovery indataalongwithadetailedanalysisofvariousminingmethodsincludingclassification,

clustering,anddecisiontree.Inadditiontothat,thechapterconcludeswithaviewofdata visualizationandprobabilityconceptsforIDA.

InChapter5,theauthorsdemonstrateoneofthemostcrucialandchallengeareasin computervisionandtheIDAfieldbasedonmanipulatingtheconvergence.Thissubjectis dividedintoadeeplearningparadigmforobjectsegmentationincomputervisionandvisualizationparadigmforefficientlyincrementalinterpretationinmanipulatingthedatasets forsupervisedandunsupervisedlearning,andonlineorofflinetraininginreinforcement learning.Thistopicrecentlyhashadalargeimpactinroboticsandautonomoussystems, fooddetection,recommendationsystems,andmedicalapplications.

DentalcariesisapainfulbacterialdiseaseofteethcausedmainlybyStreptococcus mutants,acid,andcarbohydrates,anditdestroystheenamel,orthedentine,layerof thetooth.AspertheWorldHealthOrganizationreport,worldwide,60–90%ofschool childrenandalmost100%ofadultshavedentalcaries.Dentalcariesandperiodontal diseasewithouttreatmentforlongperiodscausestoothloss.Thereisnotasinglemethod todetectcariesinitsearlieststages.Thesizeofcariouslesionsandearlycariesdetection areverychallengingtasksfordentalpractitioners.Themethodsrelatedtodentalcaries detectionaretheradiograph,QLFororquantitativelight-inducedfluorescence,ECM, FOTI,DIFOTI,etc.Inaradiograph-basedtechnique,dentistsanalyzetheimagedata. InChapter6,theauthorspresentamethodtodetectcariesbyanalyzingthesecondary emissiondata.

Withthegrowthofdataintheeducationfieldinrecentyears,thereisaneedforintelligent dataanalytics,inorderthatacademicdatashouldbeusedeffectivelytoimprovelearning. EducationaldataminingandlearninganalyticsarethefieldsofIDAthatplayimportant rolesinintelligentanalysisofeducationaldata.Oneoftherealchallengesfacedbystudents andinstitutionsalikeisthequalityofeducation.Anequallyimportantfactorrelatedto thequalityofeducationistheperformanceofstudentsinthehighereducationsystem. Thedecisionsthatthestudentsmakewhileselectingtheirareaofspecializationisofgrave concernhere.Intheabsenceofsupportsystems,thestudentsandtheteachers/mentors fallshortwhenmakingtherightdecisionsforthefurtheringoftheirchosencareerpaths. Therefore,inChapter7,theauthorsattempttoaddresstheissuebyproposingasystemthat canguidethestudenttochooseandtofocusontherightcourse(s)basedontheirpersonal preferences.Forthispurpose,asystemhasbeenenvisagedbyblendingdataminingand classificationwithbigdata.AmethodologyusingMapReduceFrameworkandassociation ruleminingisproposedinordertoderivetherightblendofcoursesforstudentstopursue toenhancetheircareerprospects.

Atmosphericairpollutioniscreatingsignificanthealthproblemsthataffectmillionsof peoplearoundtheworld.Chapter8analyzesthehypothesisaboutwhetherornotglobal greenspacevariationischangingtheglobalairquality.Theauthorsperformabigdata analysiswithadatasetthatcontainsmorethan1M(1048000)greenspacedataandair qualitydatapointsbyconsidering190countriesduringtheyears1990to2015.Airquality ismeasuredbyconsideringparticularmatter(PM)value.Theanalysisiscarriedoutusing multivariategraphsandak-meanclusteringalgorithm.Therelativegeographicalchanges ofthetreeareas,aswellastheleveloftheairquality,wereidentifiedandtheresultsindicatedencouragingnews.

Spacetechnologyandgeotechnology,suchasgeographicinformationsystems,playsa vitalroleintheday-to-dayactivitiesofasociety.Intheinitialdays,thedatacollection wasveryrudimentaryandprimitive.Thequalityofthedatacollectedwasasubjectof verificationandtheaccuracyofthedatawasalsoquestionable.Withtheadventofnewer technology,theproblemshavebeenovercome.Usingmodernsophisticatedsystems,space sciencehasbeenchangeddrastically.Implementingcutting-edgespacebornesensorshas madeitpossibletocapturereal-timedatafromspace.Chapter9focusesontheseaspectsin detail.

Transportationplaysanimportantroleinouroveralleconomy,conveyingproductsand peoplethroughprogressivelymind-boggling,interconnected,andmultidimensionaltransportationframeworks.But,thecomplexitiesofpresent-daytransportationcan’tbemanaged byprevioussystems.TheutilizationofIDAframeworksandstrategies,withcompelling informationgatheringanddatadispersionframeworks,givesopeningsthatarerequired tobuildingthefutureintelligenttransportationsystems(ITSs).InChapter10,theauthors exhibittheapplicationofIDAinIoT-basedITS.

Chapter11aimstoobserveemergingpatternsandtrendsbyusingbigdataanalysisto enhancepredictionsofmotorvehiclecollisionsusingadatasetconsistingof17attributes and998193collisionsinNewYorkCity.ThedataisextractedfromtheNewYorkCityPolice Department(NYPD).Thedatasethasthenbeentestedinthreeclassificationalgorithms, whicharek-nearestneighbor,randomforest,andnaiveBayes.Theoutputsarecaptured usingk-foldcross-validationmethod.Theseoutputsareusedtoidentifyandcompareclassifieraccuracy,andrandomforestnodeaccuracyandprocessingtime.Further,ananalysis ofrawdataisperformeddescribingthefourdifferentvehiclegroupsinordertodetectsignificancewithintherecordedperiod.Finally,extremecasesofcollisionseverityareidentified usingoutlieranalysis.Theanalysisdemonstratesthatoutofthreeclassifiers,randomforest givesthebestresults.

Neurologicaldisordersarethediseasesthatarerelatedtothebrain,nervoussystem, andthespinalcordofthehumanbody.Thesedisordersmayaffectthewalking,speaking, learning,andmovingcapacityofhumanbeings.Someofthemajorhumanneurologicaldisordersarestroke,braintumors,epilepsy,meningitis,Alzheimer’s,etc.Additionally, remarkablegrowthhasbeenobservedintheareasofdiseasediagnosisandhealthinformatics.Thecriticalhumandisordersrelatedtolung,kidney,skin,andbrainhavebeen successfullydiagnosedusingdifferentdataminingandmachinelearningtechniques.In Chapter12,severalneurologicalandpsychologicaldisordersarediscussed.Theroleofdifferentcomputingtechniquesindesigningdifferentbiomedicalapplicationsarepresented. Inaddition,thechallengesandpromisingareasofinnovationindesigningasmartand intelligentneurologicaldisorderdiagnosticsystemusingbigdata,internetofthings,and emergingcomputingtechniquesarealsohighlighted.

Bugreportsareoneofthecrucialsoftwareartifactsinopen-sourcesoftware.Issue trackingsystemsmaintainenormousbugreportswithseveralattributes,suchaslong descriptionofbugs,threadeddiscussioncomments,andbugmeta-data,whichincludes BugID,priority,status,resolution,time,andothers.InChapter13,bugreportsof20 open-sourceprojectsoftheApacheSoftwareFoundationareextractedusingatoolnamed theBugReportCollectionSystemfortrendanalysis.Asperthequantitativeanalysisof data,about20%ofopenbugsarecriticalinnature,whichdirectlyimpactsthefunctioning

ofthesystem.Thepresenceofalargenumberofbugsofthiskindcanputsystemsinto vulnerabilitypositionsandreducestheriskaversioncapability.Thus,itisessentialto resolvetheseissuesonahighpriority.Thetestleadcanassigntheseissuestothemostcontributingdevelopersofaprojectforquickclosureofopenedcriticalbugs.Thecomments aremined,whichhelpusidentifythedevelopersresolvingthemajorityofbugs,whichis beneficialfortestleadsofdistinctprojects.Asperthecollateddata,theareasmoreprone tosystemfailuresaredeterminedsuchasinput/outputtypeerrorandlogicalcodeerror. Sentimentsarethestandardwaybywhichpeopleexpresstheirfeelings.Sentimentsare broadlyclassifiedaspositiveandnegative.Theproblemoccurswhentheuserexpresses withwordsthataredifferentthantheactualfeelings.Thisphenomenonisgenerallyknown tousassarcasm,wherepeoplesaysomethingoppositetheactualsentiments.Sarcasm detectionisofgreatimportanceforthecorrectanalysisofsentiments.Chapter14attempts togiveanalgorithmforsuccessfuldetectionofhyperbolicsarcasmandgeneralsarcasmin adatasetofsarcasticpoststhatarecollectedfrompagesdedicatedforsarcasmonsocial mediasitessuchasFacebook,Pinterest,andInstagram.Thischapteralsoshowstheinitial resultsofthealgorithmanditsevaluation.

Predictiveanalyticsreferstoforecastingthefutureprobabilitiesbyextractinginformation fromexistingdatasetsanddeterminingpatternsfrompredictedoutcomes.Predictiveanalyticsalsoincludeswhat-ifscenariosandriskassessment.InChapter15,anefforthasbeen madetouseprinciplesofpredictivemodelingtoanalyzetheauthenticsocialnetworkdata set,andresultshavebeenencouraging.Thepost-analysisoftheresultshavebeenfocusedon exhibitingcontactdetails,mobilitypattern,andanumberofdegreeofconnections/minutes leadingtoidentificationofthelinkage/bondingbetweenthenodesinthesocialnetwork.

Modernmedicinehasbeenconfrontedbyamajorchallengeofachievingpromiseand capacityoftremendousexpansioninmedicaldatasetsofallkinds.Medicaldatabases develophugebulkofknowledgeanddata,whichmandatesaspecializedtooltostore andperformanalysisofdataandasaresult,effectivelyusesavedknowledgeanddata. Informationisextractedfromdatabyusingadomain’sbackgroundknowledgeinthe processofIDA.Variousmattersdealtwithregarduse,definition,andimpactofthese processesandtheyaretestedfortheiroptimizationinapplicationdomainsofmedicine. TheprimaryfocusofChapter16isonthemethodsandtoolsofIDA,withanaimto minimizethegrowingdifferencesbetweendatacomprehensionanddatagathering.

Snoozing,orsleeping,isaphysicalphenomenonofthehumanlife.Whenhumansnooze isdisturbed,itgeneratesmanyproblems,suchasmentaldisease,heartdisease,etc.Total snoozeischaracterizedbytwostages,viz.,rapideyemovementandnonrapideyemovement.Bruxismisatypeofsnoozedisorder.Thetraditionalmethodoftheprognosistakes timeandtheresultisinanalogform.Chapter17proposesamethodforeasyprognosisof snoozebruxism.

NeurodegenerativediseaseslikeAlzheimer’sandParkinson’simpairthecognitiveand motorabilitiesofthepatient,alongwithmemorylossandconfusion.Ashandwriting involvesproperfunctioningofthebrainandmotorcontrol,itisaffected.Alterationin handwritingisoneofthefirstsignsofAlzheimer’sdisease.Thehandwritinggetsshaky, duetolossofmusclecontrol,confusion,andforgetfulness.Thesymptomsgetprogressivelyworse.Itgetsillegibleandthephonologicalspellingmistakesbecomeinevitable.In Chapter18,theauthorsuseafeatureextractiontechniquetobeusedasaparameterfor

Preface

diagnosis.Avariationalautoencoder(VAE),adeepunsupervisedlearningtechnique,has beenapplied,whichisusedtocompresstheinputdataandthenreconstructitkeepingthe targetedoutputthesameasthetargetedinput.

ThiseditedvolumeonIDAgathersresearchers,scientists,andpractitionersinterested incomputationaldataanalysismethods,aimedatnarrowingthegapbetweenextensive amountsofdatastoredinmedicaldatabasesandtheinterpretation,understandable,and effectiveuseofthestoreddata.Theexpectedreadersofthisbookareresearchers,scientists,andpractitionersinterestedinIDA,knowledgediscovery,anddecisionsupportin databases,particularlythosewhoareinterestedinusingthesetechnologies.Thispublicationprovidesusefulreferencesforeducationalinstitutions,industry,academicresearchers, professionals,developers,andpractitionerstoapply,evaluate,andreproducethecontributionstothisbook.

May07,2019

NewDelhi,India DeepakGupta Bengaluru,India SiddharthaBhattacharyya NewDelhi,India AshishKhanna UttarPradesh,India KalpnaSagar

IntelligentDataAnalysis:BlackBoxVersusWhiteBoxModeling

SarthakGupta,SiddhantBagga,andDeepakKumarSharma DivisionofInformationTechnology,NetajiSubhasUniversityofTechnology,NewDelhi,India,

1.1Introduction

Inthemidstofallofthesocietalchallengesoftoday’sworld,digitaltransformationisrapidly becominganecessity.Thenumberofinternetusersisgrowingatanunprecedentedrate. Newdevices,sensors,andtechnologiesareemergingeveryday.Thesefactorshaveled toanexponentialincreaseinthevolumeofdatabeinggenerated.Accordingtoarecent research[1],usersoftheinternetgenerate2.5quintillionbytesofdataperday.

1.1.1IntelligentDataAnalysis

Dataisonlyasgoodaswhatyoumakeofit.Thesheeramountofdatabeinggeneratedcalls formethodstoleverageitspower.Withthepropertoolsandmethodologies,dataanalysis canimprovedecisionmaking,lowertherisks,andunearthhiddeninsights.Intelligentdata analysis(IDA)isconcernedwitheffectiveanalysisofdata[2,3].

TheprocessofIDAconsistsofthreemainsteps(seeFigure1.1):

1. Datacollectionandpreparation:Thisstepinvolvesacquiringdata,andconvertingitinto aformatsuitableforfurtheranalysis.Thismayinvolvestoringthedataasatable,taking careofemptyornullvalues,etc.

2. Exploration:Beforeathoroughanalysiscanbeperformedonthedata,certaincharacteristicsareexaminedlikenumberofdatapoints,includedvariables,statisticalfeatures,etc. Dataexplorationallowsanalyststogetfamiliarwiththedataset,andcreateprospective hypotheses.Visualizationisextensivelyusedinthisstep.Variousvisualizationtechniqueswillbediscussedindepthlaterinthischapter.

3. Analysis:Variousmachinelearninganddeeplearningalgorithmsareappliedatthisstep. Dataanalystsbuildmodelsthattrytofindthebestpossiblefittothedatapoints.These modelscanbeclassifiedaswhiteboxorblackboxmodels.

Amorecomprehensiveintroductiontodataanalysiscanbefoundinpriorpiecesof literature[4–6].

IntelligentDataAnalysis:FromDataGatheringtoDataComprehension, FirstEdition.EditedbyDeepakGupta,SiddharthaBhattacharyya,AshishKhanna,andKalpnaSagar. ©2020JohnWiley&SonsLtd.Published2020byJohnWiley&SonsLtd.

1.1.2ApplicationsofIDAandMachineLearning

IDAandmachinelearningcanbeappliedtoamultitudeofproductsandservices,since thesemodelshavetheabilitytomakefast,data-drivendecisionsatscale.We’resurrounded byliveexamplesofmachinelearninginthingsweuseinday-to-daylife.

Aprimaryexampleiswebpageranking[7,8].Wheneverwesearchforanythingona searchengine,theresultsthatwegetarepresentedtousintheorderofrelevance.Toachieve this,thesearchengineneedsto“know”whichpagesaremorerelevantthanothers.

Arelatedapplicationiscollaborativefiltering[9,10].Collaborativefilteringfilters informationbasedonrecommendationsofotherpeople.Itisbasedonthepremisethat peoplewhoagreedintheirevaluationofcertainitemsinthepastarelikelytoagreeagain inthefuture.

Anotherapplicationisautomatictranslationofdocumentsfromonelanguagetoanother. Manuallydoingthisisanextremelyarduoustaskandwouldtakeasignificantamount oftime.

IDAandmachinelearningmodelsarealsobeingusedformanyothertasks[11,12] likeobjectclassification,namedentityrecognition,objectlocalization,stockprices prediction,etc.

1.1.3WhiteBoxModelsVersusBlackBoxModels

IDAaimstoanalyzethedatatocreatepredictivemodels.Supposethatwe’regivenadataset D(X,T),whereXrepresentsinputsandTrepresentstargetvalues(i.e.,knowncorrectvalues withrespecttotheinput).Thegoalistolearnafunction(ormap)frominputs(X)tooutputs(T).Thisisdonebyemployingsupervisedmachinelearningalgorithms[13].Amodel referstotheartifactthatiscreatedbythetraining(orlearning)process.Modelsarebroadly categorizedintotwotypes:

1. Whiteboxmodels:Themodelswhosepredictionsareeasilyexplainablearecalledwhite boxmodels.Thesemodelsareextremelysimple,andhence,notveryeffective.Theaccuracyofwhiteboxmodelsisusuallyquitelow.Forexample–simpledecisiontrees,linear regression,logisticregression,etc.

2. Blackboxmodels:Themodelswhosepredictionsaredifficulttointerpretorexplainare calledblackboxmodels.Theyaredifficulttointerpretbecauseoftheircomplexity.Since theyarecomplexmodels,theiraccuracyisusuallyhigh.Forexample–largedecision trees,randomforests,neuralnetworks,etc.

So,IDAandmachinelearningmodelssufferfromaccuracy-explainabilitytrade-off. However,withadvancesinIDA,theexplainabilitygapinblackboxmodelsisreducing.

Figure1.1 Dataanalysisprocess.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.