Intelligent Data Analysis: From Data Gathering to Data Comprehension Deepak Gupta
Visit to download the full and correct content document: https://ebookmass.com/product/intelligent-data-analysis-from-data-gathering-to-datacomprehension-deepak-gupta/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Intelligent Data-Analytics for Condition Monitoring
Malik
https://ebookmass.com/product/intelligent-data-analytics-forcondition-monitoring-malik/

Intelligent Multi-Modal Data Processing (The Wiley Series in Intelligent Signal and Data Processing) 1st Edition Soham Sarkar
https://ebookmass.com/product/intelligent-multi-modal-dataprocessing-the-wiley-series-in-intelligent-signal-and-dataprocessing-1st-edition-soham-sarkar/

Intelligent Systems and Learning Data Analytics in Online Education: A volume in Intelligent Data-Centric Systems Santi Caballé
https://ebookmass.com/product/intelligent-systems-and-learningdata-analytics-in-online-education-a-volume-in-intelligent-datacentric-systems-santi-caballe/

Wearable Sensing and Intelligent Data Analysis for Respiratory Management Rui Pedro Paiva
https://ebookmass.com/product/wearable-sensing-and-intelligentdata-analysis-for-respiratory-management-rui-pedro-paiva/

Volume III: Data Storage, Data Processing and Data Analysis 1st ed. 2021 Edition Volker Liermann (Editor)
https://ebookmass.com/product/volume-iii-data-storage-dataprocessing-and-data-analysis-1st-ed-2021-edition-volker-liermanneditor/

Data Science With Rust: A Comprehensive Guide - Data Analysis, Machine Learning, Data Visualization & More Van Der Post
https://ebookmass.com/product/data-science-with-rust-acomprehensive-guide-data-analysis-machine-learning-datavisualization-more-van-der-post/

Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye
https://ebookmass.com/product/exploratory-data-analysis-withpython-cookbook-over-50-recipes-to-analyze-visualize-and-extractinsights-from-structured-and-unstructured-data-oluleye/

Data Wrangling on AWS: Clean and organize complex data for analysis Shukla
https://ebookmass.com/product/data-wrangling-on-aws-clean-andorganize-complex-data-for-analysis-shukla/

Big Data Management and Analytics Brij B Gupta & Mamta
https://ebookmass.com/product/big-data-management-and-analyticsbrij-b-gupta-mamta/

IntelligentDataAnalysis
FromDataGatheringtoDataComprehension
Editedby DeepakGupta
MaharajaAgrasenInstituteofTechnology Delhi,India
SiddharthaBhattacharyya CHRIST(DeemedtobeUniversity) Bengaluru,India
AshishKhanna
MaharajaAgrasenInstituteofTechnology Delhi,India
KalpnaSagar KIETGroupofInstitutions UttarPradesh,India
Thiseditionfirstpublished2020 ©2020JohnWiley&SonsLtd
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby law.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableathttp://www.wiley.com/go/ permissions.
TherightofDeepakGupta,SiddharthaBhattacharyya,AshishKhanna,andKalpnaSagartobeidentifiedasthe authorsoftheeditorialmaterialinthisworkhasbeenassertedinaccordancewithlaw.
RegisteredOffices
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK
EditorialOffice
TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitusat www.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
LimitofLiability/DisclaimerofWarranty
MATLABⓇ isatrademarkofTheMathWorks,Inc.andisusedwithpermission.TheMathWorksdoesnotwarrant theaccuracyofthetextorexercisesinthisbook.Thiswork’suseordiscussionofMATLABⓇ softwareorrelated productsdoesnotconstituteendorsementorsponsorshipbyTheMathWorksofaparticularpedagogicalapproach orparticularuseoftheMATLABⓇ software.
Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentations orwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkandspecificallydisclaim allwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityorfitnessforaparticular purpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterialsorpromotional statementsforthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasacitation and/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherandauthorsendorsethe informationorservicestheorganization,website,orproductmayprovideorrecommendationsitmaymake.This workissoldwiththeunderstandingthatthepublisherisnotengagedinrenderingprofessionalservices.The adviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaspecialist whereappropriate.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernorauthorsshallbe liableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental, consequential,orotherdamages.
LibraryofCongressCataloging-in-PublicationData
Names:Gupta,Deepak,editor.
Title:Intelligentdataanalysis:fromdatagatheringtodata comprehension/editedbyDr.DeepakGupta,Dr.Siddhartha Bhattacharyya,Dr.AshishKhanna,Ms.KalpnaSagar.
Description:Hoboken,NJ,USA:Wiley,2020.|Series:TheWileyseriesin intelligentsignalanddataprocessing|Includesbibliographical referencesandindex.
Identifiers:LCCN2019056735(print)|LCCN2019056736(ebook)|ISBN 9781119544456(hardback)|ISBN9781119544449(adobepdf)|ISBN 9781119544463(epub)
Subjects:LCSH:Datamining.|Computationalintelligence.
Classification:LCCQA76.9.D343I574352020(print)|LCCQA76.9.D343 (ebook)|DDC006.3/12–dc23
LCrecordavailableathttps://lccn.loc.gov/2019056735
LCebookrecordavailableathttps://lccn.loc.gov/2019056736
CoverDesign:Wiley
CoverImage:©gremlin/GettyImages
Setin9.5/12.5ptSTIXTwoTextbySPiGlobal,Chennai,India
DeepakGuptawouldliketodedicatethisbooktohisfather,Sh.R.K.Gupta,hismother, Smt.GeetaGupta,hismentorsfortheirconstantencouragement,andhisfamilymembers, includinghiswife,brothers,sisters,kidsandthestudents.
SiddharthaBhattacharyyawouldliketodedicatethisbooktohisparents,thelateAjitKumar BhattacharyyaandthelateHashiBhattacharyya,hisbelovedwife,Rashni,andhisresearch scholars,Sourav,Sandip,Hrishikesh,Pankaj,Debanjan,Alokananda,Koyel,andTulika.
AshishKhannawouldliketodedicatethisbooktohisparents,thelateR.C.Khannaand Smt.SurekhaKhanna,fortheirconstantencouragementandsupport,andtohiswife, Sheenu,andchildren,MasterBhavyaandMasterSanyukt.
KalpnaSagarwouldliketodedicatethisbooktoherfather,Mr.LekhRamSagar,andher mother,Smt.GomtiSagar,thestrongestpersonsofherlife.
Contents
ListofContributors xix
SeriesPreface xxiii
Preface xxv
1IntelligentDataAnalysis:BlackBoxVersusWhiteBoxModeling 1
SarthakGupta,SiddhantBagga,andDeepakKumarSharma
1.1Introduction 1
1.1.1IntelligentDataAnalysis 1
1.1.2ApplicationsofIDAandMachineLearning 2
1.1.3WhiteBoxModelsVersusBlackBoxModels 2
1.1.4ModelInterpretability 3
1.2InterpretationofWhiteBoxModels 3
1.2.1LinearRegression 3
1.2.2DecisionTree 5
1.3InterpretationofBlackBoxModels 7
1.3.1PartialDependencePlot 7
1.3.2IndividualConditionalExpectation 9
1.3.3AccumulatedLocalEffects 9
1.3.4GlobalSurrogateModels 12
1.3.5LocalInterpretableModel-AgnosticExplanations 12
1.3.6FeatureImportance 12
1.4IssuesandFurtherChallenges 13
1.5Summary 13 References 14
2Data:ItsNatureandModernDataAnalyticalTools 17
RavinderAhuja,ShikharAsthana,AyushAhuja,andManuAgarwal
2.1Introduction 17
2.2DataTypesandVariousFileFormats 18
2.2.1StructuredData 18
2.2.2Semi-StructuredData 20
2.2.3UnstructuredData 20
2.2.4NeedforFileFormats 21
2.2.5VariousTypesofFileFormats 22
2.2.5.1CommaSeparatedValues(CSV) 22
2.2.5.2ZIP 22
2.2.5.3PlainText(txt) 23
2.2.5.4JSON 23
2.2.5.5XML 23
2.2.5.6ImageFiles 24
2.2.5.7HTML 24
2.3OverviewofBigData 25
2.3.1SourcesofBigData 27
2.3.1.1Media 27
2.3.1.2TheWeb 27
2.3.1.3Cloud 27
2.3.1.4InternetofThings 27
2.3.1.5Databases 27
2.3.1.6Archives 28
2.3.2BigDataAnalytics 28
2.3.2.1DescriptiveAnalytics 28
2.3.2.2PredictiveAnalytics 28
2.3.2.3PrescriptiveAnalytics 29
2.4DataAnalyticsPhases 29
2.5DataAnalyticalTools 30
2.5.1MicrosoftExcel 30
2.5.2ApacheSpark 33
2.5.3OpenRefine 34
2.5.4RProgramming 35
2.5.4.1AdvantagesofR 36
2.5.4.2DisadvantagesofR 36
2.5.5Tableau 36
2.5.5.1HowTableauWorks 36
2.5.5.2TableauFeature 37
2.5.5.3Advantages 37
2.5.5.4Disadvantages 37
2.5.6Hadoop 37
2.5.6.1BasicComponentsofHadoop 38
2.5.6.2Benefits 38
2.6DatabaseManagementSystemforBigDataAnalytics 38
2.6.1HadoopDistributedFileSystem 38
2.6.2NoSql 38
2.6.2.1CategoriesofNoSql 39
2.7ChallengesinBigDataAnalytics 39
2.7.1StorageofData 40
2.7.2SynchronizationofData 40
2.7.3SecurityofData 40
2.7.4FewerProfessionals 40
2.8Conclusion 40 References 41
3StatisticalMethodsforIntelligentDataAnalysis:Introduction andVariousConcepts 43 ShubhamKumaram,SamarthChugh,andDeepakKumarSharma
3.1Introduction 43
3.2Probability 43
3.2.1Definitions 43
3.2.1.1RandomExperiments 43
3.2.1.2Probability 44
3.2.1.3ProbabilityAxioms 44
3.2.1.4ConditionalProbability 44
3.2.1.5Independence 44
3.2.1.6RandomVariable 44
3.2.1.7ProbabilityDistribution 45
3.2.1.8Expectation 45
3.2.1.9VarianceandStandardDeviation 45
3.2.2Bayes’Rule 45
3.3DescriptiveStatistics 46
3.3.1PictureRepresentation 46
3.3.1.1FrequencyDistribution 46
3.3.1.2SimpleFrequencyDistribution 46
3.3.1.3GroupedFrequencyDistribution 46
3.3.1.4StemandLeafDisplay 46
3.3.1.5HistogramandBarChart 47
3.3.2MeasuresofCentralTendency 47
3.3.2.1Mean 47
3.3.2.2Median 47
3.3.2.3Mode 47
3.3.3MeasuresofVariability 48
3.3.3.1Range 48
3.3.3.2BoxPlot 48
3.3.3.3VarianceandStandardDeviation 48
3.3.4SkewnessandKurtosis 48
3.4InferentialStatistics 49
3.4.1FrequentistInference 49
3.4.1.1PointEstimation 50
3.4.1.2IntervalEstimation 50
3.4.2HypothesisTesting 51
3.4.3StatisticalSignificance 51
3.5StatisticalMethods 52
3.5.1Regression 52
3.5.1.1LinearModel 52
3.5.1.2NonlinearModels 52
3.5.1.3GeneralizedLinearModels 53
3.5.1.4AnalysisofVariance 53
3.5.1.5MultivariateAnalysisofVariance 55
x Contents
3.5.1.6Log-LinearModels 55
3.5.1.7LogisticRegression 56
3.5.1.8RandomEffectsModel 56
3.5.1.9Overdispersion 57
3.5.1.10HierarchicalModels 57
3.5.2AnalysisofSurvivalData 57
3.5.3PrincipalComponentAnalysis 58
3.6Errors 59
3.6.1ErrorinRegression 60
3.6.2ErrorinClassification 61
3.7Conclusion 61 References 61
4IntelligentDataAnalysiswithDataMining:Theoryand Applications 63 ShivamBachhety,RamneekSinghal,andRachnaJain Objective 63
4.1IntroductiontoDataMining 63
4.1.1ImportanceofIntelligentDataAnalyticsinBusiness 64
4.1.2ImportanceofIntelligentDataAnalyticsinHealthCare 65
4.2DataandKnowledge 65
4.3DiscoveringKnowledgeinDataMining 66
4.3.1ProcessMining 67
4.3.2ProcessofKnowledgeDiscovery 67
4.4DataAnalysisandDataMining 69
4.5DataMining:Issues 69
4.6DataMining:SystemsandQueryLanguage 71
4.6.1DataMiningSystems 71
4.6.2DataMiningQueryLanguage 72
4.7DataMiningMethods 73
4.7.1Classification 74
4.7.2ClusterAnalysis 75
4.7.3Association 75
4.7.4DecisionTreeInduction 76
4.8DataExploration 77
4.9DataVisualization 80
4.10ProbabilityConceptsforIntelligentDataAnalysis(IDA) 83 Reference 83
5IntelligentDataAnalysis:DeepLearningandVisualization 85 ThanD.LeandHuyV.Pham
5.1Introduction 85
5.2DeepLearningandVisualization 86
5.2.1LinearandLogisticRegressionandVisualization 86
5.2.2CNNArchitecture 89
5.2.2.1VanishingGradientProblem 90
5.2.2.2ConvolutionalNeuralNetworks(CNNs) 91
5.2.3ReinforcementLearning 91
5.2.4InceptionandResNetNetworks 93
5.2.5Softmax 94
5.3DataProcessingandVisualization 97
5.3.1RegularizationforDeepLearningandVisualization 98
5.3.1.1RegularizationforLinearRegression 98
5.4ExperimentsandResults 102
5.4.1MaskRCNNBasedonObjectDetectionandSegmentation 102
5.4.2DeepMatrixFactorization 108
5.4.2.1NetworkVisualization 108
5.4.3DeepLearningandReinforcementLearning 111
5.5Conclusion 112 References 113
6ASystematicReviewontheEvolutionofDentalCariesDetection MethodsandItsSignificanceinDataAnalysisPerspective 115 SomaDatta,NabenduChaki,andBiswajitModak
6.1Introduction 115
6.1.1AnalysisofDentalCaries 115
6.2DifferentCariesLesionDetectionMethodsandDataCharacterization 119
6.2.1PointDetectionMethod 120
6.2.2VisibleLightPropertyMethod 121
6.2.3Radiographs 121
6.2.4Light-EmittingDevices 123
6.2.5OpticalCoherentTomography(OCT) 125
6.2.6SoftwareTools 125
6.3TechnicalChallengeswiththeExistingMethods 126
6.3.1ChallengesinDataAnalysisPerspective 127
6.4ResultAnalysis 129
6.5Conclusion 129 Acknowledgment 131 References 131
7IntelligentDataAnalysisUsingHadoopCluster–Inspired MapReduceFrameworkandAssociationRuleMiningonEducational Domain 137 PratiyushGuleriaandManuSood
7.1Introduction 137
7.1.1ResearchAreasofIDA 138
7.1.2TheNeedforIDAinEducation 139
7.2LearningAnalyticsinEducation 139
7.2.1RoleofWeb-EnabledandMobileComputinginEducation 141
7.2.2BenefitsofLearningAnalytics 142
7.2.3FutureResearchDirectionsofIDA 142
7.3Motivation 142
7.4LiteratureReview 143
7.4.1AssociationRuleMiningandBigData 143
7.5IntelligentDataAnalyticalTools 145
7.6IntelligentDataAnalyticsUsingMapReduceFrameworkinanEducational Domain 149
7.6.1DataDescription 149
7.6.2Objective 150
7.6.3ProposedMethodology 150
7.6.3.1Stage1MapReduceAlgorithm 150
7.6.3.2Stage2AprioriAlgorithm 150
7.7Results 151
7.8ConclusionandFutureScope 153 References 153
8InfluenceofGreenSpaceonGlobalAirQualityMonitoring:Data AnalysisUsingK-MeansClusteringAlgorithm 157 GihanS.PathiranaandMalkaN.Halgamuge
8.1Introduction 157
8.2MaterialandMethods 159
8.2.1DataCollection 159
8.2.2DataInclusionCriteria 159
8.2.3DataPreprocessing 159
8.2.4DataAnalysis 161
8.3Results 161
8.4QuantitativeAnalysis 163
8.4.1K-MeansClustering 163
8.4.2LevelofDifferenceofGreenArea 167
8.5Discussion 167
8.6Conclusion 169 References 170
9IDAwithSpaceTechnologyandGeographicInformationSystem 173 BrightKeswani,TariniCh.Mishra,AmbarishG.Mohapatra,PoonamKeswani, PriyatoshSahu,andAnishKumarSarangi
9.1Introduction 173
9.1.1Real-TimeinSpace 176
9.1.2GeneratingProgrammingTriggers 178
9.1.3AnalyticalArchitecture 178
9.1.4RemoteSensingBigDataAcquisitionUnit(RSDU) 180
9.1.5DataProcessingUnit 180
9.1.6DataAnalysisandDecisionUnit 181
9.1.7Analysis 181
9.1.8IncorporatingMachineLearningandArtificialIntelligence 181
9.1.8.1MethodologiesApplicable 182
9.1.8.2SupportVectorMachines(SVM)andCross-Validation 182
9.1.8.3MassivelyParallelComputingandI/O 183
9.1.8.4DataArchitectureandGovernance 183
9.1.9Real-TimeSpacecraftDetection 185
9.1.9.1ActivePhasedArray 186
9.1.9.2RelayCommunication 186
9.1.9.3Low-LatencyRandomAccess 186
9.1.9.4ChannelModelingandPrediction 186
9.2GeospatialTechniques 187
9.2.1TheBig-GIS 187
9.2.2TechnologiesApplied 187
9.2.2.1InternetofThingsandSensorWeb 188
9.2.2.2CloudComputing 188
9.2.2.3StreamProcessing 188
9.2.2.4BigDataAnalytics 188
9.2.2.5CoordinatedObservation 188
9.2.2.6BigGeospatialDataManagement 189
9.2.2.7ParallelGeocomputationFramework 189
9.2.3DataCollectionUsingGIS 189
9.2.3.1NoSQLDatabases 190
9.2.3.2ParallelProcessing 190
9.2.3.3KnowledgeDiscoveryandIntelligentService 190
9.2.3.4DataAnalysis 191
9.3ComparativeAnalysis 192
9.4Conclusion 192 References 194
10ApplicationofIntelligentDataAnalysisinIntelligentTransportation SystemUsingIoT 199
RakeshRoshanandOmPrakashRishi
10.1IntroductiontoIntelligentTransportationSystem(ITS) 199
10.1.1WorkingofIntelligentTransportationSystem 201
10.1.2ServicesofIntelligentTransportationSystem 201
10.1.3AdvantagesofIntelligentTransportationSystem 203
10.2IssuesandChallengesofIntelligentTransportationSystem(ITS) 204
10.2.1CommunicationTechnologyUsedCurrentlyinITS 205
10.2.2ChallengesintheImplementationofITS 206
10.2.3OpportunityforPopularityofAutomated/Autonomous/Self-Driving CarorVehicle 207
10.3IntelligentDataAnalysisMakesanIoT-BasedTransportation SystemIntelligent 208
10.3.1IntroductiontoIntelligentDataAnalysis 208
10.3.2HowIDAMakesIoT-BasedTransportationSystemsIntelligent 210
10.3.2.1TrafficManagementThroughIoTandIntelligentDataAnalysis 210
10.3.2.2TrackingofMultipleVehicles 211
10.4IntelligentDataAnalysisforSecurityinIntelligentTransportationSystem 212
10.5ToolstoSupportIDAinanIntelligentTransportationSystem 215 References 217
11ApplyingBigDataAnalyticsonMotorVehicleCollisionPredictions inNewYorkCity 219 DhanushkaAbeyratneandMalkaN.Halgamuge
11.1Introduction 219
11.1.1OverviewofBigDataAnalyticsonMotorVehicleCollisionPredictions 219
11.2MaterialsandMethods 220
11.2.1CollectionofRawData 220
11.2.2DataInclusionCriteria 220
11.2.3DataPreprocessing 220
11.2.4DataAnalysis 221
11.3ClassificationAlgorithmsandK-FoldValidationUsingDataSetObtainedfrom NYPD(2012–2017) 223
11.3.1ClassificationAlgorithms 223
11.3.1.1k-FoldCross-Validation 223
11.3.2StatisticalAnalysis 225
11.4Results 225
11.4.1MeasuredProcessingTimeandAccuracyofEachClassifier 225
11.4.2Measured p-ValueineachVehicleGroupUsingK-MeansClustering/One-Way ANOVA 227
11.4.3IdentifiedHighCollisionConcentrationLocationsofEachVehicleGroup 229
11.4.4MeasuredDifferentCriteriaforFurtherAnalysisofNYPDDataSet (2012–2017) 229
11.5Discussion 233
11.6Conclusion 237 References 238
12ASmartandPromisingNeurologicalDisorderDiagnosticSystem: AnAmalgamationofBigData,IoT,andEmergingComputing Techniques 241 PrableenKaurandManikSharma
12.1Introduction 241
12.1.1DifferenceBetweenNeurologicalandPsychologicalDisorders 241
12.2StatisticsofNeurologicalDisorders 243
12.3EmergingComputingTechniques 244
12.3.1InternetofThings 244
12.3.2BigData 245
12.3.3SoftComputingTechniques 245
12.4RelatedWorksandPublicationTrendsofArticles 249
12.5TheNeedforNeurologicalDisordersDiagnosticSystem 251
12.5.1DesignofSmartandIntelligentNeurologicalDisordersDiagnosticSystem 251
12.6Conclusion 259 References 260
13Comments-BasedAnalysisofaBugReportCollectionSystem andItsApplications 265 ArvinderKaurandShubhraGoyal
13.1Introduction 265
13.2Background 267
13.2.1IssueTrackingSystem 267
13.2.2BugReportStatistics 267
13.3RelatedWork 268
13.3.1DataExtractionProcess 268
13.3.2ApplicationsofBugReportComments 270
13.3.2.1BugSummarization 270
13.3.2.2EmotionMining 271
13.4DataCollectionProcess 272
13.4.1StepsofDataExtraction 273
13.4.2BlockDiagramforDataExtraction 274
13.4.3ReportsGenerated 274
13.4.3.1BugAttributeReport 274
13.4.3.2LongDescriptionReport 275
13.4.3.3BugCommentsReports 275
13.4.3.4ErrorReport 275
13.5AnalysisofBugReports 275
13.5.1ResearchQuestion1:IsthePerformanceofSoftwareAffectedbyOpenBugs thatareCriticalinNature? 275
13.5.2ResearchQuestion2:HowCanTestLeadsImprovethePerformanceofSoftware Systems? 277
13.5.3ResearchQuestion3:WhichAretheMostError-ProneAreasthatCanCause SystemFailure? 277
13.5.4ResearchQuestion4:WhichAretheMostFrequentWordsandKeywordsto PredictMostCriticalBugs? 279
13.5.5ResearchQuestions5:WhatIstheImportanceofFrequentWordsMinedfrom BugReports? 281
13.6ThreatstoValidity 284
13.7Conclusion 284 References 286
14SarcasmDetectionAlgorithmsBasedonSentimentStrength 289 PragyaKatyayanandNisheethJoshi
14.1Introduction 289
14.2LiteratureSurvey 291
14.3Experiment 294
14.3.1DataCollection 294
14.3.2FindingSentiStrengths 294
14.3.3ProposedAlgorithm 295
14.3.4ExplanationoftheAlgorithms 297
14.3.5Classification 300
14.3.5.1Explanation 300
14.3.6Evaluation 302
14.4ResultsandEvaluation 303
14.5Conclusion 305 References 305
15SNAP:SocialNetworkAnalysisUsingPredictiveModeling 307 SamridhiSethandRahulJohari
15.1Introduction 307
15.1.1TypesofPredictiveAnalyticsModels 307
15.1.2PredictiveAnalyticsTechniques 308
15.1.2.1RegressionTechniques 308
15.1.2.2MachineLearningTechniques 308
15.2LiteratureSurvey 309
15.3ComparativeStudy 313
15.4SimulationandAnalysis 313
15.4.1FewAnalysesMadeontheDataSetAreGivenBelow 314
15.4.1.1DurationofEachContactWasFound 314
15.4.1.2TotalNumberofContactsofSourceNodewithDestinationNodeWasFoundfor allNodes 314
15.4.1.3TotalDurationofContactofSourceNodewithEachNodeWasFound 315
15.4.1.4MobilityPatternDescribesDirectionofContactandRelationBetweenNumber ofContactsandDurationofContact 315
15.4.1.5UnidirectionalContact,thatis,Only1NodeisContactingSecondNodebut ViceVersaIsNotThere 317
15.4.1.6GraphicalRepresentationfortheDurationofContactswithEachNodeisGiven below 317
15.4.1.7RankandPercentileforNumberofContactswithEachNode 320
15.4.1.8DataSetIsDescribedforThreeDaysWhereTimeIsCalculatedinSeconds. DataSetcanbeDividedIntoThreeDays.SomeoftheAnalysesConductedon theDatasetDayWiseAreGivenBelow 326
15.5ConclusionandFutureWork 329
References 329
16IntelligentDataAnalysisforMedicalApplications 333 MoolchandSharma,VikasChaudhary,PrernaSharma,andR.S.Bhatia
16.1Introduction 333
16.1.1IDA(IntelligentDataAnalysis) 335
16.1.1.1ElicitationofBackgroundKnowledge 337
16.1.2MedicalApplications 337
16.2IDANeedsinMedicalApplications 338
16.2.1PublicHealth 339
16.2.2ElectronicHealthRecord 339
16.2.3PatientProfileAnalytics 339
16.2.3.1Patient’sProfile 339
16.3IDAMethodsClassifications 339
16.3.1DataAbstraction 339
16.3.2DataMiningMethod 340
16.3.3TemporalDataMining 341
16.4IntelligentDecisionSupportSysteminMedicalApplications 341
16.4.1NeedforIntelligentDecisionSystem(IDS) 342
16.4.2UnderstandingIntelligentDecisionSupport:SomeDefinitions 342
16.4.3Advantages/DisadvantagesofIDS 344
16.5Conclusion 345 References 345
17BruxismDetectionUsingSingle-ChannelC4-A1onHumanSleepS2 StageRecording 347 MdBelalBinHeyat,DakunLai,FaijanAkhtar,MohdAmmarBinHayat,Shafan Azad,ShadabAzad,andShajanAzad
17.1Introduction 347
17.1.1SideEffectofPoorSnooze 348
17.2HistoryofSleepDisorder 349
17.2.1ClassificationofSleepDisorder 349
17.2.2SleepStagesoftheHuman 351
17.3ElectroencephalogramSignal 351
17.3.1ElectroencephalogramGeneration 351
17.3.1.1ClassificationofElectroencephalogramSignal 352
17.4EEGDataMeasurementTechnique 352
17.4.110–20ElectrodePositioningSystem 352
17.4.1.1ProcedureofElectrodeplacement 353
17.5LiteratureReview 354
17.6SubjectsandMethodology 354
17.6.1DataCollection 354
17.6.2LowPassFilter 355
17.6.3HanningWindow 355
17.6.4WelchMethod 356
17.7DataAnalysisoftheBruxismandNormalDataUsingEEGSignal 356
17.8Result 358
17.9Conclusions 361
Acknowledgments 363 References 364
18HandwritingAnalysisforEarlyDetectionofAlzheimer’sDisease 369 RajibSaha,AnirbanMukherjee,AniruddhaSadhukhan,AnishaRoy,andManashiDe
18.1IntroductionandBackground 369
18.2ProposedWorkandMethodology 376
18.3ResultsandDiscussions 379
18.3.1CharacterSegmentation 380
18.4Conclusion 384 References 385
Index 387
ListofContributors
AmbarishG.Mohapatra SiliconInstituteofTechnology Bhubaneswar
India
AnirbanMukherjee RCCInstituteofInformationTechnology WestBengal
India
AniruddhaSadhukhan RCCInstituteofInformationTechnology WestBengal
India
AnishaRoy RCCInstituteofInformationTechnology WestBengal
India
ArvinderKaur
GuruGobindSinghIndraprastha University
India
AyushAhuja JaypeeInstituteofInformationTechnology Noida
India
BiswajitModak NabadwipStateGeneralHospital Nabadwip
India
R.S.Bhatia NationalInstituteofTechnology
Kurukshetra
India
BrightKeswani SureshGyanViharUniversity Jaipur
India
DakunLai UniversityofElectronicScienceand TechnologyofChina
Chengdu
China
DeepakKumarSharma NetajiSubhasUniversityofTechnology NewDelhi
India
DhanushkaAbeyratne Yellowfin(HQ) TheUniversityofMelbourne
Australia
FaijanAkhtar JamiaHamdard
NewDelhi
India
xx ListofContributors
GihanS.Pathirana CharlesSturtUniversity
Melbourne
Australia
HuyV.Pham TonDucThangUniversity
Vietnam
MalkaN.Halgamuge TheUniversityofMelbourne
Australia
ManashiDe TechnoIndia WestBengal
India
ManikSharma DAVUniversity Jalandhar
India
ManuAgarwal JaypeeInstituteofInformationTechnology Noida
India
ManuSood UniversityShimla
India
MdBelalBinHeyat UniversityofElectronicScienceand TechnologyofChina Chengdu
China
MohdAmmarBinHayat MedicalUniversity
India
MoolchandSharma MaharajaAgrasenInstituteofTechnology (MAIT)
Delhi
India
NabenduChaki UniversityofCalcutta Kolkata
India
NisheethJoshi BanasthaliVidyapith Rajasthan
India
OmPrakashRishi UniversityofKota
India
PoonamKeswani AkashdeepPGCollege Jaipur
India
PrableenKaur DAVUniversity Jalandhar
India
PragyaKatyayan BanasthaliVidyapith Rajasthan
India
PratiyushGuleria UniversityShimla
India
PrernaSharma MaharajaAgrasenInstituteofTechnology (MAIT)
Delhi
India
RachnaJain
BharatiVidyapeeth’sCollegeof Engineering NewDelhi
India
RahulJohari GGSIPUniversity NewDelhi
India
RajibSaha RCCInstituteofInformationTechnology WestBengal
India
RakeshRoshan InstituteofManagementStudies Ghaziabad
India
RamneekSinghal
BharatiVidyapeeth’sCollegeof Engineering NewDelhi
India
RavinderAhuja JaypeeInstituteofInformationTechnology Noida India
SamarthChugh NetajiSubhasUniversityofTechnology NewDelhi
India
SamridhiSeth GGSIPUniversity NewDelhi
India
ListofContributors
SarthakGupta NetajiSubhasUniversityofTechnology NewDelhi
India
ShadabAzad ChaudharyCharanSinghUniversity Meerut
India
ShafanAzad Dr.A.P.J.AbdulKalamTechnical University
UttarPradesh
India
ShajanAzad HayatInstituteofNursing Lucknow
India
ShikharAsthana JaypeeInstituteofInformationTechnology Noida
India
ShivamBachhety BharatiVidyapeeth’sCollegeof Engineering NewDelhi
India ShubhamKumaram NetajiSubhasUniversityofTechnology NewDelhi
India
ShubhraGoyal GuruGobindSinghIndraprastha University
India
xxii ListofContributors
SiddhantBagga
NetajiSubhasUniversityofTechnology
NewDelhi India
SomaDatta UniversityofCalcutta Kolkata India
TariniCh.Mishra SiliconInstituteofTechnology
Bhubaneswar India
ThanD.Le UniversityofBordeaux France
VikasChaudhary KIET
Ghaziabad India
SeriesPreface
Dr.SiddharthaBhattacharyya,CHRIST(Deemedtobe University),Bengaluru,India(SeriesEditor)
TheIntelligentSignalandDataProcessing(ISDP)bookseriesisaimedatfostering thefieldofsignalanddataprocessing,whichencompassesthetheoryandpracticeof algorithmsandhardwarethatconvertsignalsproducedbyartificialornaturalmeansinto aformusefulforaspecificpurpose.Thesignalsmightbespeech,audio,images,video, sensordata,telemetry,electrocardiograms,orseismicdata,amongothers.Thepossible applicationareasincludetransmission,display,storage,interpretation,classification, segmentation,ordiagnosis.TheprimaryobjectiveoftheISDPbookseriesistoevolve future-generationscalableintelligentsystemsforfaithfulanalysisofsignalsanddata. ISDPismainlyintendedtoenrichthescholarlydiscourseonintelligentsignalandimage processingindifferentincarnations.ISDPwillbenefitawiderangeoflearners,including students,researchers,andpractitioners.Thestudentcommunitycanusethevolumesin theseriesasreferencetextstoadvancetheirknowledgebase.Inaddition,themonographs willalsocomeinhandytotheaspiringresearcherbecauseofthevaluablecontributions bothhavemadeinthisfield.Moreover,bothfacultymembersanddatapractitionersare likelytograspdepthoftherelevantknowledgebasefromthesevolumes.
Theseriescoveragewillcontain,notexclusively,thefollowing:
1.Intelligentsignalprocessing
a)Adaptivefiltering
b)Learningalgorithmsforneuralnetworks
c)Hybridsoft-computingtechniques
d)Spectrumestimationandmodeling
2.Imageprocessing
a)Imagethresholding
b)Imagerestoration
c)Imagecompression
d)Imagesegmentation
e)Imagequalityevaluation
f)Computervisionandmedicalimaging
g)Imagemining
h)Patternrecognition
i)Remotesensingimagery
j)Underwaterimageanalysis
k)Gestureanalysis
l)Humanmindanalysis
m)Multidimensionalimageanalysis
3.Speechprocessing
a)Modeling
b)Compression
c)Speechrecognitionandanalysis
4.Videoprocessing
a)Videocompression
b)Analysisandprocessing
c)3Dvideocompression
d)Targettracking
e)Videosurveillance
f)Automatedanddistributedcrowdanalytics
g)Stereo-to-autostereoscopic3Dvideoconversion
h)Virtualandaugmentedreality
5.Dataanalysis
a)Intelligentdataacquisition
b)Datamining
c)Exploratorydataanalysis
d)Modelingandalgorithms
e)Bigdataanalytics
f)Businessintelligence
g)Smartcitiesandsmartbuildings
h)Multiwaydataanalysis
i)Predictiveanalytics
j)Intelligentsystems
Preface
Intelligentdataanalysis(IDA),knowledgediscovery,anddecisionsupporthaverecently becomemorechallengingresearchfieldsandhavegainedmuchattentionamongalarge numberofresearchersandpractitioners.Inourview,theawarenessofthesechallenging researchfieldsandemergingtechnologiesamongtheresearchcommunitywillincrease theapplicationsinbiomedicalscience.Thisbookaimstopresentthevariousapproaches, techniques,andmethodsthatareavailableforIDA,andtopresentcasestudiesoftheir application.
Thisvolumecomprises18chaptersfocusingonthelatestadvancesinIDAtoolsandtechniques.
Machinelearningmodelsarebroadlycategorizedintotwotypes:whiteboxandblackbox. Duetothedifficultyininterpretingtheirinnerworkings,somemachinelearningmodels areconsideredblackboxmodels.Chapter1focusesonthedifferentmachinelearningmodels,alongwiththeiradvantagesandlimitationsasfarastheanalysisofdataisconcerned. Withtheadvancementoftechnology,theamountofdatageneratedisverylarge.The datageneratedhasusefulinformationthatneedstobegatheredbydataanalyticstoolsin ordertomakebetterdecisions.InChapter2,thedefinitionofdataanditsclassifications basedondifferentfactorsisgiven.Thereaderwilllearnabouthowandwhatdataisand aboutthebreakupofthedata.Afteradescriptionofwhatdatais,thechapterwillfocuson definingandexplainingbigdataandthevariouschallengesfacedbydealingwithbigdata. Theauthorsalsodescribevarioustypesofanalyticsthatcanbeperformedonlargedata andsixdataanalyticstools(MicrosoftExcel,ApacheSpark,OpenRefine,R,Hadoop,and Tableau).
Inrecentyears,thewidespreaduseofcomputersandtheinternethasledtothegenerationofdataonanunprecedentedscale.Tomakeaneffectiveuseofthisdata,itisnecessary thatdatamustbecollectedandanalyzedsothatinferencescanbemadetoimprovevariousproductsandservices.Statisticsdealswiththecollection,organization,andanalysisof data.TheorganizationanddescriptionofdataisstudiedunderthesestatisticsinChapter 3whileanalysisofdataandhowtomakepredictionsbasedonitisdealtwithininferential statistics.
AfterhavinganideaaboutvariousaspectsofIDAinthepreviouschapters,Chapter4 dealswithanoverviewofdatamining.Italsodiscussestheprocessofknowledgediscovery indataalongwithadetailedanalysisofvariousminingmethodsincludingclassification,
clustering,anddecisiontree.Inadditiontothat,thechapterconcludeswithaviewofdata visualizationandprobabilityconceptsforIDA.
InChapter5,theauthorsdemonstrateoneofthemostcrucialandchallengeareasin computervisionandtheIDAfieldbasedonmanipulatingtheconvergence.Thissubjectis dividedintoadeeplearningparadigmforobjectsegmentationincomputervisionandvisualizationparadigmforefficientlyincrementalinterpretationinmanipulatingthedatasets forsupervisedandunsupervisedlearning,andonlineorofflinetraininginreinforcement learning.Thistopicrecentlyhashadalargeimpactinroboticsandautonomoussystems, fooddetection,recommendationsystems,andmedicalapplications.
DentalcariesisapainfulbacterialdiseaseofteethcausedmainlybyStreptococcus mutants,acid,andcarbohydrates,anditdestroystheenamel,orthedentine,layerof thetooth.AspertheWorldHealthOrganizationreport,worldwide,60–90%ofschool childrenandalmost100%ofadultshavedentalcaries.Dentalcariesandperiodontal diseasewithouttreatmentforlongperiodscausestoothloss.Thereisnotasinglemethod todetectcariesinitsearlieststages.Thesizeofcariouslesionsandearlycariesdetection areverychallengingtasksfordentalpractitioners.Themethodsrelatedtodentalcaries detectionaretheradiograph,QLFororquantitativelight-inducedfluorescence,ECM, FOTI,DIFOTI,etc.Inaradiograph-basedtechnique,dentistsanalyzetheimagedata. InChapter6,theauthorspresentamethodtodetectcariesbyanalyzingthesecondary emissiondata.
Withthegrowthofdataintheeducationfieldinrecentyears,thereisaneedforintelligent dataanalytics,inorderthatacademicdatashouldbeusedeffectivelytoimprovelearning. EducationaldataminingandlearninganalyticsarethefieldsofIDAthatplayimportant rolesinintelligentanalysisofeducationaldata.Oneoftherealchallengesfacedbystudents andinstitutionsalikeisthequalityofeducation.Anequallyimportantfactorrelatedto thequalityofeducationistheperformanceofstudentsinthehighereducationsystem. Thedecisionsthatthestudentsmakewhileselectingtheirareaofspecializationisofgrave concernhere.Intheabsenceofsupportsystems,thestudentsandtheteachers/mentors fallshortwhenmakingtherightdecisionsforthefurtheringoftheirchosencareerpaths. Therefore,inChapter7,theauthorsattempttoaddresstheissuebyproposingasystemthat canguidethestudenttochooseandtofocusontherightcourse(s)basedontheirpersonal preferences.Forthispurpose,asystemhasbeenenvisagedbyblendingdataminingand classificationwithbigdata.AmethodologyusingMapReduceFrameworkandassociation ruleminingisproposedinordertoderivetherightblendofcoursesforstudentstopursue toenhancetheircareerprospects.
Atmosphericairpollutioniscreatingsignificanthealthproblemsthataffectmillionsof peoplearoundtheworld.Chapter8analyzesthehypothesisaboutwhetherornotglobal greenspacevariationischangingtheglobalairquality.Theauthorsperformabigdata analysiswithadatasetthatcontainsmorethan1M(1048000)greenspacedataandair qualitydatapointsbyconsidering190countriesduringtheyears1990to2015.Airquality ismeasuredbyconsideringparticularmatter(PM)value.Theanalysisiscarriedoutusing multivariategraphsandak-meanclusteringalgorithm.Therelativegeographicalchanges ofthetreeareas,aswellastheleveloftheairquality,wereidentifiedandtheresultsindicatedencouragingnews.
Spacetechnologyandgeotechnology,suchasgeographicinformationsystems,playsa vitalroleintheday-to-dayactivitiesofasociety.Intheinitialdays,thedatacollection wasveryrudimentaryandprimitive.Thequalityofthedatacollectedwasasubjectof verificationandtheaccuracyofthedatawasalsoquestionable.Withtheadventofnewer technology,theproblemshavebeenovercome.Usingmodernsophisticatedsystems,space sciencehasbeenchangeddrastically.Implementingcutting-edgespacebornesensorshas madeitpossibletocapturereal-timedatafromspace.Chapter9focusesontheseaspectsin detail.
Transportationplaysanimportantroleinouroveralleconomy,conveyingproductsand peoplethroughprogressivelymind-boggling,interconnected,andmultidimensionaltransportationframeworks.But,thecomplexitiesofpresent-daytransportationcan’tbemanaged byprevioussystems.TheutilizationofIDAframeworksandstrategies,withcompelling informationgatheringanddatadispersionframeworks,givesopeningsthatarerequired tobuildingthefutureintelligenttransportationsystems(ITSs).InChapter10,theauthors exhibittheapplicationofIDAinIoT-basedITS.
Chapter11aimstoobserveemergingpatternsandtrendsbyusingbigdataanalysisto enhancepredictionsofmotorvehiclecollisionsusingadatasetconsistingof17attributes and998193collisionsinNewYorkCity.ThedataisextractedfromtheNewYorkCityPolice Department(NYPD).Thedatasethasthenbeentestedinthreeclassificationalgorithms, whicharek-nearestneighbor,randomforest,andnaiveBayes.Theoutputsarecaptured usingk-foldcross-validationmethod.Theseoutputsareusedtoidentifyandcompareclassifieraccuracy,andrandomforestnodeaccuracyandprocessingtime.Further,ananalysis ofrawdataisperformeddescribingthefourdifferentvehiclegroupsinordertodetectsignificancewithintherecordedperiod.Finally,extremecasesofcollisionseverityareidentified usingoutlieranalysis.Theanalysisdemonstratesthatoutofthreeclassifiers,randomforest givesthebestresults.
Neurologicaldisordersarethediseasesthatarerelatedtothebrain,nervoussystem, andthespinalcordofthehumanbody.Thesedisordersmayaffectthewalking,speaking, learning,andmovingcapacityofhumanbeings.Someofthemajorhumanneurologicaldisordersarestroke,braintumors,epilepsy,meningitis,Alzheimer’s,etc.Additionally, remarkablegrowthhasbeenobservedintheareasofdiseasediagnosisandhealthinformatics.Thecriticalhumandisordersrelatedtolung,kidney,skin,andbrainhavebeen successfullydiagnosedusingdifferentdataminingandmachinelearningtechniques.In Chapter12,severalneurologicalandpsychologicaldisordersarediscussed.Theroleofdifferentcomputingtechniquesindesigningdifferentbiomedicalapplicationsarepresented. Inaddition,thechallengesandpromisingareasofinnovationindesigningasmartand intelligentneurologicaldisorderdiagnosticsystemusingbigdata,internetofthings,and emergingcomputingtechniquesarealsohighlighted.
Bugreportsareoneofthecrucialsoftwareartifactsinopen-sourcesoftware.Issue trackingsystemsmaintainenormousbugreportswithseveralattributes,suchaslong descriptionofbugs,threadeddiscussioncomments,andbugmeta-data,whichincludes BugID,priority,status,resolution,time,andothers.InChapter13,bugreportsof20 open-sourceprojectsoftheApacheSoftwareFoundationareextractedusingatoolnamed theBugReportCollectionSystemfortrendanalysis.Asperthequantitativeanalysisof data,about20%ofopenbugsarecriticalinnature,whichdirectlyimpactsthefunctioning
ofthesystem.Thepresenceofalargenumberofbugsofthiskindcanputsystemsinto vulnerabilitypositionsandreducestheriskaversioncapability.Thus,itisessentialto resolvetheseissuesonahighpriority.Thetestleadcanassigntheseissuestothemostcontributingdevelopersofaprojectforquickclosureofopenedcriticalbugs.Thecomments aremined,whichhelpusidentifythedevelopersresolvingthemajorityofbugs,whichis beneficialfortestleadsofdistinctprojects.Asperthecollateddata,theareasmoreprone tosystemfailuresaredeterminedsuchasinput/outputtypeerrorandlogicalcodeerror. Sentimentsarethestandardwaybywhichpeopleexpresstheirfeelings.Sentimentsare broadlyclassifiedaspositiveandnegative.Theproblemoccurswhentheuserexpresses withwordsthataredifferentthantheactualfeelings.Thisphenomenonisgenerallyknown tousassarcasm,wherepeoplesaysomethingoppositetheactualsentiments.Sarcasm detectionisofgreatimportanceforthecorrectanalysisofsentiments.Chapter14attempts togiveanalgorithmforsuccessfuldetectionofhyperbolicsarcasmandgeneralsarcasmin adatasetofsarcasticpoststhatarecollectedfrompagesdedicatedforsarcasmonsocial mediasitessuchasFacebook,Pinterest,andInstagram.Thischapteralsoshowstheinitial resultsofthealgorithmanditsevaluation.
Predictiveanalyticsreferstoforecastingthefutureprobabilitiesbyextractinginformation fromexistingdatasetsanddeterminingpatternsfrompredictedoutcomes.Predictiveanalyticsalsoincludeswhat-ifscenariosandriskassessment.InChapter15,anefforthasbeen madetouseprinciplesofpredictivemodelingtoanalyzetheauthenticsocialnetworkdata set,andresultshavebeenencouraging.Thepost-analysisoftheresultshavebeenfocusedon exhibitingcontactdetails,mobilitypattern,andanumberofdegreeofconnections/minutes leadingtoidentificationofthelinkage/bondingbetweenthenodesinthesocialnetwork.
Modernmedicinehasbeenconfrontedbyamajorchallengeofachievingpromiseand capacityoftremendousexpansioninmedicaldatasetsofallkinds.Medicaldatabases develophugebulkofknowledgeanddata,whichmandatesaspecializedtooltostore andperformanalysisofdataandasaresult,effectivelyusesavedknowledgeanddata. Informationisextractedfromdatabyusingadomain’sbackgroundknowledgeinthe processofIDA.Variousmattersdealtwithregarduse,definition,andimpactofthese processesandtheyaretestedfortheiroptimizationinapplicationdomainsofmedicine. TheprimaryfocusofChapter16isonthemethodsandtoolsofIDA,withanaimto minimizethegrowingdifferencesbetweendatacomprehensionanddatagathering.
Snoozing,orsleeping,isaphysicalphenomenonofthehumanlife.Whenhumansnooze isdisturbed,itgeneratesmanyproblems,suchasmentaldisease,heartdisease,etc.Total snoozeischaracterizedbytwostages,viz.,rapideyemovementandnonrapideyemovement.Bruxismisatypeofsnoozedisorder.Thetraditionalmethodoftheprognosistakes timeandtheresultisinanalogform.Chapter17proposesamethodforeasyprognosisof snoozebruxism.
NeurodegenerativediseaseslikeAlzheimer’sandParkinson’simpairthecognitiveand motorabilitiesofthepatient,alongwithmemorylossandconfusion.Ashandwriting involvesproperfunctioningofthebrainandmotorcontrol,itisaffected.Alterationin handwritingisoneofthefirstsignsofAlzheimer’sdisease.Thehandwritinggetsshaky, duetolossofmusclecontrol,confusion,andforgetfulness.Thesymptomsgetprogressivelyworse.Itgetsillegibleandthephonologicalspellingmistakesbecomeinevitable.In Chapter18,theauthorsuseafeatureextractiontechniquetobeusedasaparameterfor
Preface
diagnosis.Avariationalautoencoder(VAE),adeepunsupervisedlearningtechnique,has beenapplied,whichisusedtocompresstheinputdataandthenreconstructitkeepingthe targetedoutputthesameasthetargetedinput.
ThiseditedvolumeonIDAgathersresearchers,scientists,andpractitionersinterested incomputationaldataanalysismethods,aimedatnarrowingthegapbetweenextensive amountsofdatastoredinmedicaldatabasesandtheinterpretation,understandable,and effectiveuseofthestoreddata.Theexpectedreadersofthisbookareresearchers,scientists,andpractitionersinterestedinIDA,knowledgediscovery,anddecisionsupportin databases,particularlythosewhoareinterestedinusingthesetechnologies.Thispublicationprovidesusefulreferencesforeducationalinstitutions,industry,academicresearchers, professionals,developers,andpractitionerstoapply,evaluate,andreproducethecontributionstothisbook.
May07,2019
NewDelhi,India DeepakGupta Bengaluru,India SiddharthaBhattacharyya NewDelhi,India AshishKhanna UttarPradesh,India KalpnaSagar
IntelligentDataAnalysis:BlackBoxVersusWhiteBoxModeling
SarthakGupta,SiddhantBagga,andDeepakKumarSharma DivisionofInformationTechnology,NetajiSubhasUniversityofTechnology,NewDelhi,India,
1.1Introduction
Inthemidstofallofthesocietalchallengesoftoday’sworld,digitaltransformationisrapidly becominganecessity.Thenumberofinternetusersisgrowingatanunprecedentedrate. Newdevices,sensors,andtechnologiesareemergingeveryday.Thesefactorshaveled toanexponentialincreaseinthevolumeofdatabeinggenerated.Accordingtoarecent research[1],usersoftheinternetgenerate2.5quintillionbytesofdataperday.
1.1.1IntelligentDataAnalysis
Dataisonlyasgoodaswhatyoumakeofit.Thesheeramountofdatabeinggeneratedcalls formethodstoleverageitspower.Withthepropertoolsandmethodologies,dataanalysis canimprovedecisionmaking,lowertherisks,andunearthhiddeninsights.Intelligentdata analysis(IDA)isconcernedwitheffectiveanalysisofdata[2,3].
TheprocessofIDAconsistsofthreemainsteps(seeFigure1.1):
1. Datacollectionandpreparation:Thisstepinvolvesacquiringdata,andconvertingitinto aformatsuitableforfurtheranalysis.Thismayinvolvestoringthedataasatable,taking careofemptyornullvalues,etc.
2. Exploration:Beforeathoroughanalysiscanbeperformedonthedata,certaincharacteristicsareexaminedlikenumberofdatapoints,includedvariables,statisticalfeatures,etc. Dataexplorationallowsanalyststogetfamiliarwiththedataset,andcreateprospective hypotheses.Visualizationisextensivelyusedinthisstep.Variousvisualizationtechniqueswillbediscussedindepthlaterinthischapter.
3. Analysis:Variousmachinelearninganddeeplearningalgorithmsareappliedatthisstep. Dataanalystsbuildmodelsthattrytofindthebestpossiblefittothedatapoints.These modelscanbeclassifiedaswhiteboxorblackboxmodels.
Amorecomprehensiveintroductiontodataanalysiscanbefoundinpriorpiecesof literature[4–6].
IntelligentDataAnalysis:FromDataGatheringtoDataComprehension, FirstEdition.EditedbyDeepakGupta,SiddharthaBhattacharyya,AshishKhanna,andKalpnaSagar. ©2020JohnWiley&SonsLtd.Published2020byJohnWiley&SonsLtd.
1.1.2ApplicationsofIDAandMachineLearning
IDAandmachinelearningcanbeappliedtoamultitudeofproductsandservices,since thesemodelshavetheabilitytomakefast,data-drivendecisionsatscale.We’resurrounded byliveexamplesofmachinelearninginthingsweuseinday-to-daylife.
Aprimaryexampleiswebpageranking[7,8].Wheneverwesearchforanythingona searchengine,theresultsthatwegetarepresentedtousintheorderofrelevance.Toachieve this,thesearchengineneedsto“know”whichpagesaremorerelevantthanothers.
Arelatedapplicationiscollaborativefiltering[9,10].Collaborativefilteringfilters informationbasedonrecommendationsofotherpeople.Itisbasedonthepremisethat peoplewhoagreedintheirevaluationofcertainitemsinthepastarelikelytoagreeagain inthefuture.
Anotherapplicationisautomatictranslationofdocumentsfromonelanguagetoanother. Manuallydoingthisisanextremelyarduoustaskandwouldtakeasignificantamount oftime.
IDAandmachinelearningmodelsarealsobeingusedformanyothertasks[11,12] likeobjectclassification,namedentityrecognition,objectlocalization,stockprices prediction,etc.
1.1.3WhiteBoxModelsVersusBlackBoxModels
IDAaimstoanalyzethedatatocreatepredictivemodels.Supposethatwe’regivenadataset D(X,T),whereXrepresentsinputsandTrepresentstargetvalues(i.e.,knowncorrectvalues withrespecttotheinput).Thegoalistolearnafunction(ormap)frominputs(X)tooutputs(T).Thisisdonebyemployingsupervisedmachinelearningalgorithms[13].Amodel referstotheartifactthatiscreatedbythetraining(orlearning)process.Modelsarebroadly categorizedintotwotypes:
1. Whiteboxmodels:Themodelswhosepredictionsareeasilyexplainablearecalledwhite boxmodels.Thesemodelsareextremelysimple,andhence,notveryeffective.Theaccuracyofwhiteboxmodelsisusuallyquitelow.Forexample–simpledecisiontrees,linear regression,logisticregression,etc.
2. Blackboxmodels:Themodelswhosepredictionsaredifficulttointerpretorexplainare calledblackboxmodels.Theyaredifficulttointerpretbecauseoftheircomplexity.Since theyarecomplexmodels,theiraccuracyisusuallyhigh.Forexample–largedecision trees,randomforests,neuralnetworks,etc.
So,IDAandmachinelearningmodelssufferfromaccuracy-explainabilitytrade-off. However,withadvancesinIDA,theexplainabilitygapinblackboxmodelsisreducing.