Immediate download Accelerators for convolutional neural networks arslan munir ebooks 2024

Accelerators for Convolutional Neural Networks Arslan Munir

Visit to download the full and correct content document: https://ebookmass.com/product/accelerators-for-convolutional-neural-networks-arslan -munir/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Graph neural networks for efficient learning of mechanical properties of polycrystals Jonathan M. Hestroffer

https://ebookmass.com/product/graph-neural-networks-forefficient-learning-of-mechanical-properties-of-polycrystalsjonathan-m-hestroffer/

Physics-Informed Neural Networks M. Raissi & P. Perdikaris & G.E. Karniadakis

https://ebookmass.com/product/physics-informed-neural-networks-mraissi-p-perdikaris-g-e-karniadakis/

Analysis and Visualization of Discrete Data Using Neural Networks Koji Koyamada

https://ebookmass.com/product/analysis-and-visualization-ofdiscrete-data-using-neural-networks-koji-koyamada/

Building Computer Vision Applications Using Artificial Neural Networks, 2nd Edition Shamshad Ansari

https://ebookmass.com/product/building-computer-visionapplications-using-artificial-neural-networks-2nd-editionshamshad-ansari/

Quantum-like Networks: An Approach to Neural Behavior through their Mathematics and Logic 1st Edition Stephen

A. Selesnick

https://ebookmass.com/product/quantum-like-networks-an-approachto-neural-behavior-through-their-mathematics-and-logic-1stedition-stephen-a-selesnick/

AI Applications to Communications and Information Technologies: The Role of Ultra Deep Neural Networks

Daniel Minoli

https://ebookmass.com/product/ai-applications-to-communicationsand-information-technologies-the-role-of-ultra-deep-neuralnetworks-daniel-minoli/

Python Deep Learning: Understand how deep neural networks work and apply them to real-world tasks 3rd Edition Vasilev

https://ebookmass.com/product/python-deep-learning-understandhow-deep-neural-networks-work-and-apply-them-to-real-worldtasks-3rd-edition-vasilev/

Networks for Social Impact Michelle Shumate

https://ebookmass.com/product/networks-for-social-impactmichelle-shumate/

Startup Program Design: A Practical Guide for Creating Corporate Accelerators and Incubators Paolo Lombardi

https://ebookmass.com/product/startup-program-design-a-practicalguide-for-creating-corporate-accelerators-and-incubators-paololombardi/

AcceleratorsforConvolutionalNeuralNetworks

IEEEPress

445HoesLane

Piscataway,NJ08854

IEEEPressEditorialBoard

SarahSpurgeon, EditorinChief

JónAtliBenediktsson

AnjanBose

JamesDuncan

AminMoeness

DesineniSubbaramNaidu

BehzadRazavi

JimLyke

HaiLi

BrianJohnson

JeffreyReed

DiomidisSpinellis

AdamDrobot

TomRobertazzi

AhmetMuratTekalp

AcceleratorsforConvolutionalNeural Networks

ArslanMunir

KansasStateUniversity

USA

JoonhoKong

KyungpookNationalUniversity

SouthKorea

MahmoodAzharQureshi

KansasStateUniversity

USA

PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey. PublishedsimultaneouslyinCanada.

Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinany formorbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise, exceptaspermittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,without eitherthepriorwrittenpermissionofthePublisher,orauthorizationthroughpaymentofthe appropriateper-copyfeetotheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers, MA01923,(978)750-8400,fax(978)750-4470,oronthewebatwww.copyright.com.Requeststo thePublisherforpermissionshouldbeaddressedtothePermissionsDepartment,JohnWiley& Sons,Inc.,111RiverStreet,Hoboken,NJ07030,(201)748-6011,fax(201)748-6008,oronlineat http://www.wiley.com/go/permission.

Trademarks:WileyandtheWileylogoaretrademarksorregisteredtrademarksofJohnWiley& Sons,Inc.and/oritsaffiliatesintheUnitedStatesandothercountriesandmaynotbeused withoutwrittenpermission.Allothertrademarksarethepropertyoftheirrespectiveowners. JohnWiley&Sons,Inc.isnotassociatedwithanyproductorvendormentionedinthisbook.

LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbest effortsinpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttothe accuracyorcompletenessofthecontentsofthisbookandspecificallydisclaimanyimplied warrantiesofmerchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedor extendedbysalesrepresentativesorwrittensalesmaterials.Theadviceandstrategiescontained hereinmaynotbesuitableforyoursituation.Youshouldconsultwithaprofessionalwhere appropriate.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhave changedordisappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthe publishernorauthorsshallbeliableforanylossofprofitoranyothercommercialdamages, includingbutnotlimitedtospecial,incidental,consequential,orotherdamages.

Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,please contactourCustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidethe UnitedStatesat(317)572-3993orfax(317)572-4002.

Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsin printmaynotbeavailableinelectronicformats.FormoreinformationaboutWileyproducts, visitourwebsiteatwww.wiley.com.

LibraryofCongressCataloging-in-PublicationDataAppliedfor: HardbackISBN:9781394171880

CoverDesign:Wiley

CoverImage:©Gorodenkoff/Shutterstock;MichaelTraitov/Shutterstock

Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India

ArslanMunirdedicatesthisbooktohiswifeNedaandhisparentsfortheir continuoussupport.

JoonhoKongdedicatesthisbooktohiswifeJiyeon,childrenEunseoandEunyu, andhisparentsfortheircontinuoussupport.

MahmoodAzharQureshidedicatesthisbooktohisparents,siblings,andhiswife Kiran,allofwhomprovidedcontinuoussupportthroughouthisacademicand professionalcareer.

Contents

AbouttheAuthors xiii

Preface xv

PartIOverview 1

1Introduction 3

1.1HistoryandApplications 5

1.2PitfallsofHigh-AccuracyDNNs/CNNs 6

1.2.1ComputeandEnergyBottleneck 6

1.2.2SparsityConsiderations 9

1.3ChapterSummary 11

2OverviewofConvolutionalNeuralNetworks 13

2.1DeepNeuralNetworkArchitecture 13

2.2ConvolutionalNeuralNetworkArchitecture 15

2.2.1DataPreparation 17

2.2.2BuildingBlocksofCNNs 17

2.2.2.1ConvolutionalLayers 17

2.2.2.2PoolingLayers 19

2.2.2.3FullyConnectedLayers 20

2.2.3ParametersofCNNs 21

2.2.4HyperparametersofCNNs 21

2.2.4.1HyperparametersRelatedtoNetworkStructure 22

2.2.4.2HyperparametersRelatedtoTraining 23

2.2.4.3HyperparameterTuning 25

2.3PopularCNNModels 26

2.3.1AlexNet 26

2.3.2VGGNet 26

2.3.3GoogleNet 27

2.3.4SqueezeNet 27

2.3.5BinaryNeuralNetworks 29

2.3.6EfficientNet 29

2.4PopularCNNDatasets 30

2.4.1MNISTDataset 30

2.4.2CIFAR 30

2.4.3ImageNet 31

2.5CNNProcessingHardware 31

2.5.1TemporalArchitectures 32

2.5.2SpatialArchitectures 34

2.5.3Near-MemoryProcessing 36

2.6ChapterSummary 37

PartIICompressiveCodingforCNNs 39

3ContemporaryAdvancesinCompressiveCoding forCNNs 41

3.1BackgroundofCompressiveCoding 41

3.2CompressiveCodingforCNNs 43

3.3LossyCompressionforCNNs 43

3.4LosslessCompressionforCNNs 44

3.5RecentAdvancementsinCompressiveCodingforCNNs 48

3.6ChapterSummary 50

4LosslessInputFeatureMapCompression 51

4.1Two-StepInputFeatureMapCompressionTechnique 52

4.2Evaluation 55

4.3ChapterSummary 57

5ArithmeticCodingandDecodingfor5-BitCNNWeights 59

5.1ArchitectureandDesignOverview 60

5.2AlgorithmOverview 63

5.2.1WeightEncodingAlgorithm 64

5.3WeightDecodingAlgorithm 67

5.4EncodingandDecodingExamples 69

5.4.1DecodingHardware 72

5.5EvaluationMethodology 74

5.6EvaluationResults 75

5.6.1CompressionRatioandMemoryEnergyConsumption 75

5.6.2LatencyOverhead 78

5.6.3Latencyvs.ResourceUsageTrade-Off 81

5.6.4System-LevelEnergyEstimation 83

5.7ChapterSummary 84

PartIIIDenseCNNAccelerators 85

6ContemporaryDenseCNNAccelerators 87

6.1BackgroundonDenseCNNAccelerators 87

6.2RepresentationoftheCNNWeightsandFeatureMapsin DenseFormat 87

6.3PopularArchitecturesforDenseCNNAccelerators 89

6.4RecentAdvancementsinDenseCNNAccelerators 92

6.5ChapterSummary 93

7iMAC:Image-to-ColumnandGeneralMatrix Multiplication-BasedDenseCNNAccelerator 95

7.1BackgroundandMotivation 95

7.2Architecture 97

7.3Implementation 99

7.4ChapterSummary 100

8NeuroMAX:ADenseCNNAccelerator 101

8.1RelatedWork 102

8.2LogMapping 103

8.3HardwareArchitecture 105

8.3.1Top-Level 105

8.3.2PEMatrix 106

8.4DataFlowandProcessing 108

8.4.13 × 3Convolution 109

8.4.21 × 1Convolution 111

8.4.3Higher-OrderConvolutions 116

8.5ImplementationandResults 118

8.6ChapterSummary 124

PartIVSparseCNNAccelerators 125

9ContemporarySparseCNNAccelerators 127

9.1BackgroundofSparsityinCNNModels 127

x Contents

9.2BackgroundofSparseCNNAccelerators 128

9.3RecentAdvancementsinSparseCNNAccelerators 131

9.4ChapterSummary 133

10CNNAcceleratorforInSituDecompressionandConvolution ofSparseInputFeatureMaps 135

10.1Overview 135

10.2HardwareDesignOverview 135

10.3DesignOptimizationTechniquesUtilizedintheHardware Accelerator 140

10.4FPGAImplementation 141

10.5EvaluationResults 143

10.5.1PerformanceandEnergy 144

10.5.2ComparisonwithState-of-the-ArtHardwareAccelerator Implementations 146

10.6ChapterSummary 149

11Sparse-PE:ASparseCNNAccelerator 151

11.1RelatedWork 155

11.2Sparse-PE 156

11.2.1SparseBinaryMask 158

11.2.2Selection 159

11.2.3Computation 164

11.2.4Accumulation 170

11.2.5OutputEncoding 172

11.3ImplementationandResults 174

11.3.1Cycle-AccurateSimulator 175

11.3.1.1PerformancewithVaryingSparsity 177

11.3.1.2ComparisonAgainstPastApproaches 179

11.3.2RTLImplementation 181

11.4ChapterSummary 184

12Phantom:AHigh-PerformanceComputationalCorefor SparseCNNs 185

12.1RelatedWork 189

12.2Phantom 190

12.2.1SparseMaskRepresentation 191

12.2.2CoreArchitecture 192

12.2.3LookaheadMasking 192

12.2.4Top-DownSelector 193

12.2.4.1In-OrderSelection 194

12.2.4.2Out-of-OrderSelection 195

12.2.5ThreadMapper 196

12.2.6ComputeEngine 197

12.2.7OutputBuffer 199

12.2.8OutputEncoding 201

12.3Phantom-2D 201

12.3.1 R × C ComputeMatrix 202

12.3.2LoadBalancing 203

12.3.3Regular/DepthwiseConvolution 204

12.3.3.1IntercoreBalancing 204

12.3.4PointwiseConvolution 204

12.3.5FCLayers 207

12.3.6IntracoreBalancing 207

12.4ExperimentsandResults 209

12.4.1EvaluationMethodology 209

12.4.1.1Cycle-AccurateSimulator 209

12.4.1.2SimulatedModels 211

12.4.2Results 211

12.4.2.1TDSVariantsComparison 211

12.4.2.2ImpactofLoadBalancing 212

12.4.2.3SensitivitytoSparsityand Lf 213

12.4.2.4ComparisonAgainstPastApproaches 215

12.4.2.5RTLSynthesisResults 217

12.5ChapterSummary 218

PartVHW/SWCo-DesignandCo-SchedulingforCNN Acceleration 221

13State-of-the-ArtinHW/SWCo-DesignandCo-Scheduling forCNNAcceleration 223

13.1HW/SWCo-Design 223

13.1.1CaseStudy:CognitiveIoT 225

13.1.2RecentAdvancementsinHW/SWCo-Design 227

13.2HW/SWCo-Scheduling 228

13.2.1RecentAdvancementsinHW/SWCo-Scheduling 229

13.3ChapterSummary 230

14Hardware/SoftwareCo-DesignforCNNAcceleration 231

14.1BackgroundofiMACAccelerator 231

14.2SoftwarePartitionforiMACAccelerator 232

14.2.1ChannelPartitionandInput/WeightAllocationtoHardware Accelerator 232

14.2.2ExploitingParallelismWithinConvolutionLayerOperations 234

14.3ExperimentalEvaluations 235

14.4ChapterSummary 237

15CPU-AcceleratorCo-SchedulingforCNNAcceleration 239

15.1BackgroundandPreliminaries 240

15.1.1ConvolutionalNeuralNetworks 240

15.1.2BaselineSystemArchitecture 241

15.2CNNAccelerationwithCPU-AcceleratorCo-Scheduling 242

15.2.1Overview 242

15.2.2LinearRegression-BasedLatencyModel 243

15.2.2.1AcceleratorLatencyModel 243

15.2.2.2CPULatencyModel 246

15.2.3ChannelDistribution 246

15.2.4PrototypeImplementation 247

15.3ExperimentalResults 251

15.3.1LatencyModelAccuracy 251

15.3.2Performance 253

15.3.3Energy 254

15.3.4CaseStudy:TinyDarknetCNNInferences 255 15.4ChapterSummary 257

16Conclusions 259

References 265 Index 285

AbouttheAuthors

ArslanMunir iscurrentlyanAssociateProfessor intheDepartmentofComputerScienceatKansas StateUniversity.Hewasapostdoctoralresearch associateintheElectricalandComputerEngineering(ECE)DepartmentatRiceUniversity,Houston,Texas,USA,fromMay2012toJune2014.He receivedhisMAScinECEfromtheUniversityof BritishColumbia(UBC),Vancouver,Canada,in 2007andhisPhDinECEfromtheUniversityof Florida(UF),Gainesville,Florida,USA,in2012. From2007to2008,heworkedasasoftwaredevelopmentengineeratMentorGraphicsCorporationin theEmbeddedSystemsDivision.

Munir’scurrentresearchinterestsincludeembeddedandcyberphysicalsystems,artificialintelligence,deeplearninghardware,computervision,secureand trustworthysystems,parallelcomputing,andreconfigurablecomputing.Munir receivedmanyacademicawardsincludingthedoctoralfellowshipfromNatural SciencesandEngineeringResearchCouncil(NSERC)ofCanada.Heearnedgold medalsforbestperformanceinelectricalengineering,andgoldmedals,andacademicrollofhonorforsecuringrankoneinpre-engineeringprovincialexaminations(outofapproximately300,000candidates).HeisaseniormemberofIEEE.

JoonhoKong iscurrentlyanAssociateProfessorwiththeSchoolofElectronicsEngineering, KyungpookNationalUniversity.HereceivedtheBS degreeincomputerscienceandtheMSandPhD degreesincomputerscienceandengineeringfrom KoreaUniversity,in2007,2009,and2011,respectively.Heworkedaspostdoctoralresearchassociate withtheDepartmentofElectricalandComputer Engineering,RiceUniversity,from2012to2014. BeforejoiningKyungpookNationalUniversity,he alsoworkedasaSeniorEngineeratSamsungElectronics,from2014to2015.Hisresearchinterests includecomputerarchitecture,heterogeneouscomputing,embeddedsystems, hardware/softwareco-design,AI/MLaccelerators,andhardwaresecurity.Heis amemberofIEEE.

MahmoodAzharQureshi iscurrentlyaSenior DesignEngineeratIntelCorporation.Hereceived hisPhDinComputerSciencefromKansasState University,Manhattan,Kansas,in2021wherehe alsoworkedasaresearchassistantfrom2018to 2021.HereceivedhisMSinelectricalengineering fromtheUniversityofEngineeringandTechnology (UET),Taxila,Pakistan,in2018andBEinelectrical engineeringfromNationalUniversityofSciences andTechnology(NUST),Pakistan,in2013.From 2014to2018,heworkedasaSeniorRTLDesign EngineeratCenterforAdvancedResearchin Engineering(CARE)Pvt.Ltd,Islamabad,Pakistan.Duringthesummerof2020, heinternedatMathWorks,USA,wherehewasactivelyinvolvedinaddingnew featuresinMatlabwhichisatoolusedgloballyinindustryaswellasacademia. Duringfall2020,heinternedatTesla,workingonthefailureanalysisofthe infotainmenthardwarefortheTeslaModel3andModelYglobalfeet.

Preface

Convolutionalneuralnetworks(CNNs)havegainedtremendoussignificance inthedomainofartificialintelligence(AI)becauseoftheiruseinavarietyof applicationsrelatedtovisualimageryanalysis.Therehasbeenadrasticincrease intheaccuracyofCNNsinrecentyears,whichhashelpedCNNsmakeitsway inreal-worldapplications.Thisincreaseinaccuracy,however,translatesintoa sizablemodelandhighcomputationalrequirements,whichmakethedeployment oftheseCNNsinresource-limitedcomputingplatformsachallengingendeavor. Thus,embeddingCNNinferenceintovariousreal-worldapplicationsrequiresthe designofhigh-performance,area,andenergy-efficientacceleratorarchitectures. ThisbooktargetsthedesignofacceleratorsforCNNs.

Thisbookisorganizedintofiveparts:overview,compressivecodingforCNNs, denseCNNaccelerators,sparseCNNaccelerators,andHW/SWco-designand co-schedulingforCNNacceleration.Thefirstpartofthebookprovidesan overviewofCNNsalongwiththecompositionofdifferentcontemporaryCNN models.Thebookthendiscussessomeofthearchitecturalandalgorithmic techniquesforefficientprocessingofCNNmodels.Thesecondpartofthebook discussescompressivecodingforCNNstocompressCNNweightsandfeature maps.ThispartofthebookthendiscussesHuffmancodingforlosslesscompressionofCNNweightsandfeaturemaps.Thebookthenelucidatesatwo-step losslessinputfeaturemapscompressionmethodfollowedbydiscussionofan arithmeticcodinganddecoding-basedlosslessweightscompressionmethod.The thirdpartofthebookfocusesonthedesignofdenseCNNaccelerators.Thebook providesadiscussiononcontemporarydenseCNNaccelerators.Thebookthen presentsaniMACdenseCNNaccelerator,whichcombinesimage-to-columnand generalmatrixmultiplicationhardwareaccelerationfollowedbythediscussion ofanotherdenseCNNacceleratorthatutilizeslog-basedprocessingelementsand 2Ddataflowtomaximizedatareuseandhardwareutilization.Thefourthpart ofthebooktargetssparseCNNaccelerator.Thebookdiscussescontemporary sparseCNNsthatconsidersparsityinweightsandactivationmaps(i.e.,many

weightsandactivationsinCNNsarezeroandresultinineffectualcomputations) todeliverhigheffectivethroughput.ThebookthenpresentsasparseCNN acceleratorthatperformsinsitudecompressionandconvolutionofsparseinput featuremaps.Afterwards,thebookdiscussesasparseCNNaccelerator,which hasthecapabilitytoactivelyskipahugenumberofineffectivecomputations(i.e., computationsinvolvingzeroweightsand/oractivations),whileonlyfavoring effectivecomputations(nonzeroweightsandnonzeroactivations)todrastically improvethehardwareutilization.ThebookthenpresentsanothersparseCNN acceleratorthatusessparsebinarymaskrepresentationtoactivelylookahead intosparsecomputations,anddynamicallyscheduleitscomputationalthreads tomaximizethethreadutilizationandthroughput.Thefifthpartofthebook targetshardware/softwareco-designandco-schedulingforCNNacceleration. Thebookdiscusseshardware/softwareco-designandco-schedulingthatcan leadtobetteroptimizationandutilizationoftheavailablehardwareresources forCNNacceleration.Thebooksummarizesrecentworksonhardware/software co-designandscheduling.Thebookthenpresentsatechniquethatutilizes software,algorithm,andhardwareco-designtoreducetheresponsetimeof CNNinferences.Afterwards,thebookdiscussesaCPU-acceleratorco-scheduling technique,whichco-utilizestheCPUandCNNacceleratorstoexpeditetheCNN inference.Thebookalsoprovidesdirectionsforfutureresearchanddevelopment forCNNaccelerators.

ThisisthefirstbookonthesubjectofacceleratorsforCNNsthatintroduces readerstoadvancesandstate-of-the-artresearchindesignofCNNaccelerators. Thisbookcanserveasagoodreferenceforstudents,researchers,andpractitionersworkingintheareaofhardwaredesign,computerarchitecture,andAI acceleration.

January24,2023

Manhattan,KS,USA

PartI Overview

Introduction

Deepneuralnetworks(DNNs)haveenabledthedeploymentofartificialintelligence(AI)inmanymodernapplicationsincludingautonomousdriving[1], imagerecognition[2],andspeechprocessing[3].Inmanyapplications,DNNs haveachievedclosetohuman-levelaccuracyand,insome,theyhaveexceeded humanaccuracy[4].ThishighaccuracycomesfromaDNN’suniqueabilityto automaticallyextracthigh-levelfeaturesfromahugequantityoftrainingdata usingstatisticallearningandimprovementovertime.Thislearningovertime providesaDNNwithaneffectiverepresentationoftheinputspace.Thisisquite differentfromtheearlierapproacheswherespecificfeatureswerehand-crafted bydomainexpertsandweresubsequentlyusedforfeatureextraction.

Convolutionalneuralnetworks(CNNs)areatypeofDNNs,whicharemost commonlyusedforcomputervisiontasks.AmongdifferenttypesofDNNs, suchasmultilayerperceptrons(MLP),recurrentneuralnetworks(RNNs),long short-termmemory(LSTM)networks,radialbasisfunctionnetworks(RBFNs), generativeadversarialnetworks(GANs),restrictedBoltzmannmachines(RBMs), deepbeliefnetworks(DBNs),andautoencoders,CNNsarethemostlycommonly used.InventionofCNNshasrevolutionizedthefieldofcomputervisionand hasenabledmanyapplicationsofcomputervisiontogomainstream.CNNs haveapplicationsinimageandvideorecognition,recommendersystems,image classification,imagesegmentation,medicalimageanalysis,objectdetection, activityrecognition,naturallanguageprocessing,brain–computerinterfaces,and financialtime-seriesprediction.

DNN/CNNprocessingisusuallycarriedoutintwostages,trainingandinference,withbothofthemhavingtheirowncomputationalneeds.Trainingisthe processwhereaDNNmodelistrainedusingalargeapplication-specificdata set.Thetrainingtimeisdependentonthemodelsizeandthetargetaccuracy requirements.Forhighaccuracyapplicationslikeautonomousdriving,traininga DNNcantakeweeksandisusuallyperformedonacloud.Inference,ontheother

Trainingdataset

Devicesensorcapturing real-worlddata

Edgedeviceprocessing system

TrainedDNNmodel deployedonthe edgedevice

Cloud-hostedtrainingEdgedeviceinference

Figure1.1 DNN/CNNprocessingmethodology.Source:(b)Daughter#3Cecil/WikimediaCommons/CCBY-SA2.0.

hand,canbeperformedeitheronthecloudortheedgedevice(mobiledevice, Internetofthings(IoT),autonomousvehicle,etc.).Nowadays,inmanyapplications,itisadvantageoustoperformtheinferenceprocessontheedgedevices,as showninFigure1.1.Forexample,incellphones,itisdesirabletoperformimage andvideoprocessingonthedeviceitselfratherthansendingthedataoverto thecloudforprocessing.Thismethodologyreducesthecommunicationcostand thelatencyinvolvedwiththedatatransmissionandreception.Italsoeliminates theriskoflosingimportantdevicefeaturesshouldtherebeanetworkdisruption orlossofconnectivity.Anothermotivationfordoinginferenceonthedeviceisthe ever-increasingsecurityriskinvolvedwithsendingpersonalizeddata,including imagesandvideos,overtothecloudserversforprocessing.Autonomousdriving systemswhichrequirevisualdataneedtodeploysolutionstoperforminference locallytoavoidlatencyandsecurityissues,bothofwhichcanresultinacatastrophe,shouldanundesirableeventoccurs.PerformingDNN/CNNinference ontheedgepresentsitsownsetofchallenges.Thisstemsfromthefactthatthe embeddedplatformsrunningontheedgedeviceshavestringentcostlimitations whichlimittheircomputecapabilities.Runningcomputeandmemory-intensive DNN/CNNinferenceinthesedevicesinanefficientmannerbecomesamatterof primeimportance.

LionLion Dog

1.1HistoryandApplications

Neuralnetshavebeenaroundsincethe1940s;however,thefirstpracticallyapplicableneuralnetwork,referredtoastheLeNet[5],wasproposedin1989.Thisneuralnetworkwasdesignedtosolvetheproblemofdigitrecognitioninhand-written numericdigits.Itpavedthewayforthedevelopmentofneuralnetworksresponsibleforvariousapplicationsrelatedtodigitrecognition,suchasanautomated tellermachine(ATM),opticalcharacterrecognition(OCR),automaticnumber platerecognition,andtrafficsignsrecognition.Theslowgrowthandalittletono adoptionofneuralnetworksintheearlydaysismainlyduetothemassivecomputationalrequirementsinvolvedwiththeirprocessingwhichlimitedtheirstudy totheoreticalconcepts.

Overthepastdecade,therehasbeenanexponentialgrowthintheresearchon DNNswithmanynewhighaccuracyneuralnetworksbeingdeployedforvarious applications.Thishasonlybeenpossiblebecauseoftwofactors.Thefirstfactoris theadvancementsintheprocessingpowerofsemiconductordevicesandtechnologicalbreakthroughsincomputerarchitecture.Nowadays,computershavesignificantlyhighercomputingcapability.Thisenablestheprocessingofaneural networkwithinareasonabletimeframe,somethingthatwasnotachievablein theearlydays.Thesecondfactoristheavailabilityofalargeamountoftraining datasets.Asneuralnetworkslearnovertime,providinghugeamountsoftraining dataenablesbetteraccuracy.Forexample,Meta(parentcompanyofFacebook) receivesclosetoabillionuserimagesperday,whereasYouTubehas300hoursof videouploadedeveryminute[6].Thisenablestheserviceproviderstotraintheir neuralnetworksfortargetedadvertisingcampaignsbringinginbillionsofdollars ofadvertisingrevenue.Apartfromtheiruseinsocialmediaplatforms,DNNsare impactingmanyotherdomainsandaremakingahugeimpact.Someoftheseareas include:

● SpeechProcessing:Speechprocessingalgorithmshaveimprovedsignificantly inthepastfewyears.Nowadays,manyapplicationshavebeendevelopedthat useDNNstoperformreal-timespeechrecognitionwithunprecedentedlevels ofaccuracy[3,7–9].ManytechnologycompaniesarealsousingDNNstoperformlanguagetranslationusedinawidevarietyofapplications.Google,for example,usesGoogle’sneuralmachinetranslationsystem(GNMT)[10]which usesLSTM-basedseq2seqmodelfortheirlanguagetranslationapplications.

● AutonomousDriving:Autonomousdrivinghasbeenoneofthebiggesttechnologicalbreakthroughsintheautoindustrysincetheinventionoftheinternal combustionengine.Itisnotacoincidencethattheself-drivingboomcameatthe sametimewhenhighaccuracyCNNsbecameincreasinglypopular.Companies

likeTesla[11]andWaymo[12]areusingvarioustypesofself-drivingtechnologyincludingvisualfeedsandLidarfortheirself-drivingsolutions.Onething whichiscommoninallthesesolutionsistheuseofCNNsforvisualperception oftheroadconditionswhichisthemainback-endtechnologyusedinadvanced driverassistancesystems(ADAS).

● MedicalAI:AnothercrucialareawhereDNNs/CNNshavebecomeincreasinglyusefulismedicine.Nowadays,doctorscanuseAI-assistedmedical imagerytoperformvarioussurgeries.AIsystemsuseDNNsingenomicsto gatherinsightsaboutgeneticdisorderslikeautism[13,14].DNNs/CNNsare alsousefulinthedetectionofvarioustypesofcancerslikeskinandbraincancer [15,16].

● Security:TheadventofAIhaschallengedmanytraditionalsecurityapproaches thatwerepreviouslydeemedsufficient.Therolloutof5Gtechnologyhascaused amassivesurgeofIoT-baseddeploymentswhichtraditionalsecurityapproaches arenotabletokeepupwith.Physicalunclonabilityapproaches[17–21]were introducedtoprotectthismassivedeploymentofIoTsagainstsecurityattacks withminimumcostoverheads.Theseapproaches,however,werealsounsuccessfulinpreventingAI-assistedattacksusingDNNs[22,23].Researchershave nowbeenforcedtoupgradethesecuritythreatmodelstoincorporateAI-based attacks[24,25].BecauseofamassiveincreaseinAI-assistedcyber-attackson cloudanddatacenters,companieshaverealizedthatthebestwayofdefeating offensiveAIattacksisbyincorporatingAI-basedcounterattacks[26,27].

Overall,theuseofDNNs,inparticularCNNs,invariousapplicationshasseen exponentialgrowthoverthepastdecade,andthistrendhasbeenontheriseforthe pastmanyyears.ThemassiveincreaseinCNNdeploymentsontheedgedevices requiresthedevelopmentofefficientprocessingarchitecturestokeepupwiththe computationalrequirementsforsuccessfulCNNinference.

1.2PitfallsofHigh-AccuracyDNNs/CNNs

Thissectiondiscussessomeofthepitfallsofhigh-accuracyDNN/CNNmodelsfocusingoncomputeandenergybottlenecks,andtheeffectofsparsityof high-accuracymodelsonthroughputandhardwareutilization.

1.2.1ComputeandEnergyBottleneck

CNNsarecomposedofmultipleconvolutionlayers(CONV)whichhelpinextractinglow-,mid-,andhigh-levelinputfeaturesforbetteraccuracy.AlthoughCNNs areprimarilyusedinapplicationsrelatedtoimageandvideoprocessing,theyare

Table1.1 PopularCNNmodels. CNNmodelLayers

AlexNet[30]863.384.662M666M

VGG-16[31]1674.391.9138M15.3B

GoogleNet[35]2268.9886.8M1.5B

MobileNet[35]2870.989.94.2M569M

ResNet-50[32]5075.392.225.5M3.9B

alsousedinspeechprocessing[3,7],gameplay[28],androbotics[29]applications.WewillfurtherdiscussthebasicsofCNNsinChapter2.Inthissection, weexploresomeofthebottleneckswhenitcomestoimplementing high-accuracy CNNinferenceenginesinembeddedmobiledevices.

ThedevelopmentofhighaccuracyCNNmodels[30–34]inrecentyearshas strengthenedthenotionofemployingDNNsinvariousAIapplications.TheclassificationaccuracyofCNNsfortheImageNetchallenge[2]hasimprovedconsiderablyfrom63.3%in2012(AlexNet[30])toastaggering87.3%(EfficientNetV2 [4]in2021).Thishighjumpinaccuracycomeswithhighcomputeandenergy costsforCNNinference.Table1.1showssomeofthemostcommonlyusedCNN models.ThemodelsaretrainedusingtheImageNetdataset[2],andthetop-1and top-5classificationaccuracyisalsogiven.Wenotethattop-1accuracyistheconventionalaccuracy,whichmeansthatthemodelanswer(i.e.,theonepredicted bythemodelwiththehighestprobability)mustbeexactlytheexpectedanswer. Top-5accuracymeansthatanyofthefivehighestprobabilityanswerspredicted bythemodelmustmatchtheexpectedanswer.ItcanbeseenfromTable1.1that theadditionofmorelayersresultsinbetteraccuracy.Thisaddition,however,also correspondstoagreaternumberofmodelparameters,requiringmorememory andstorage.Italsoresultsinhighermultiply-accumulate(MAC)operations,causinganincreaseincomputationalcomplexityandresourcerequirements,whichin turn,affectstheperformanceoftheedgedevices.

Eventhoughsomeeffortshavebeenmadetoreducethesizeofthehighaccuracy models,theystillrequiremassiveamountsofcomputationsoveraseriesofnetworklayerstoperformaparticularinferencetask(classification,segmentation, etc.).Thesetremendousnumberofcomputations(typicallyintensofmillions) presentahugechallengeforthe neuralnetworkaccelerators (NNAs)running theCNNinference.NNAsarespecializedhardwareblocksinsideacomputersystem(e.g.,mobiledevicesandcloudservers)thatspeedupthecomputationsof theCNNinferenceprocesstomaintainthereal-timerequirementsofthesystem

32bDRAMRead

32bSRAMRead

32bFPMultiply

16bFPMultiply

32bINTMultiply

8bINTMultiply

32bFPAdd

16bFPAdd

32bINTAdd

16bINTAdd

Figure1.2 Energycost(relativeto8bitAddoperation)shownonalog10scalefora 45nmprocesstechnology.Source:Adaptedfrom[6,36]. andimprovesystemthroughput.Apartfromthemassivecomputationalrequirements,theadditionofmorelayersforhigheraccuracydrasticallyincreasesthe CNNmodelsize.ThispreventstheCNNmodelfrombeingstoredinthelimited on-chipstaticrandomaccessmemory(SRAM)oftheedgedevice,and,therefore, requiresoff-chipdynamicrandomaccessmemory(DRAM)whichpresentsahigh DRAMaccessenergycost.

Toputthisinperspective,theenergycostperfetchfor32bitcoefficientsinan off-chiplow-powerdoubledatarate2(LPDDR2)DRAMisabout640pJ,which isabout6400× theenergycostofa32bitintegerADDoperation[36].Thebigger themodelis,themorememoryreferencingisperformedtoaccessthemodeldata whichinturnexpendsmoreenergy.Figure1.2showstheenergycostofvarious computeandmemoryoperationsrelativetoan8bitintegeradd(8bitINTAdd) operation.ItcanbeseenthattheDRAMReadoperationdominatestheenergy graphwiththe32bitDRAMReadconsuminggreaterthan4ordersofmagnitude higherenergythanthe8bitINTAdd.Asaconsequence,theenergycostfromjust theDRAMaccesseswouldbewellbeyondthelimitationsofanembeddedmobile devicewithlimitedbatterylife.Therefore,inadditiontoacceleratingthecompute operations,theNNAalsoneedstominimizetheoff-chipmemorytransactionsfor decreasingtheoverallenergyconsumption.

Manyalgorithm-leveltechniqueshavebeendevelopedtominimizethecomputationalrequirementsofaCNNwithoutincurringalossinaccuracy.Sincethe maincomputebottleneckinCNNinferenceistheCONVoperation, Mobilenets [33,34]weredevelopedtoreducethetotalnumberofCONVoperations.These CNNsdrasticallyreducethetotalnumberofparametersandMACoperationsby

breakingdownthestandard2Dconvolutionintodepthwiseseparableandpointwiseconvolutions.Thedepthwiseseparableandpointwiseconvolutionsresultin 8× to9× reductionintotalcomputationscomparedtoregularCONVoperations, withaslightdecreaseinaccuracy.Theyalsoeliminatevaryingfiltersizes,and instead,use3 × 3and1 × 1filtersforperformingconvolutionoperations.This makesthemidealforembeddedmobiledevicesbecauseoftheirrelativelylow memoryfootprintandlowertotalMACoperations.

Awidelyusedapproachfordecreasingthememorybottleneckisthereductionintheprecisionofbothweightsandactivationsusingvariousquantization strategies[37–39].Thisagaindoesnotresultinasignificantlossinaccuracyand reducesthemodelsizebyaconsiderableamount.Hardwareimplementationslike Envision[40],UNPU[41],andStripes[42]showhowreducedbitprecision,and quantization,translatesintobettersavingsinenergy.

1.2.2SparsityConsiderations

Nonlinearactivationfunctions[6],inadditiontodeeplayers,isoneofthekeycharacteristicsthatimprovetheaccuracyofaCNNmodel.Typically,nonlinearityis addedbyincorporatingactivationfunctions,themostcommonbeingtherectified linearunit(ReLU)[6].TheReLUconvertsallnegativevaluesinafeaturemapto zeros.Sincetheoutputofonelayeristheinputtothenextlayer,manyofthecomputations,withinalayer,involvemultiplicationwithzeros.Thesefeaturemaps containingzerosarereferredtoas one-sided sparsefeaturemaps.Themultiplicationsresultingfromthisone-sidedsparsitywastecomputecyclesanddecreasethe effective throughputandhardwareutilization,thus,reducingtheperformanceof theaccelerator.Italsoresultsinhighenergycostsasthetransferofzerosto/from off-chipmemoryiswastedmemoryaccess.Inordertoreducethecomputational andmemoryaccessvolume,previousworks[43–45]haveexploitedthisone-sided sparsityanddisplayedsomeperformanceimprovements.Toexacerbatetheissue ofwastedcomputecyclesandmemoryaccesses, two-sided sparsityisintroduced inCNNsoftenbypruningtechniqueswhen,inadditiontothefeaturemaps,the weightdataalsoconsistsofzeros.DesigningaCNNacceleratorthatcanovercome thewastedcomputecyclesandmemoryaccessesissuesofone-sidedandtwo-sided sparsitiesisquitechallenging.

Inrecentyears,manypruningtechniqueshavebeendevelopedforthecompressionofDNNmodels[46–49].Hanetal.[46]iterativelyprunedtheconnectionsbasedonparameterthresholdandperformedretrainingtoretainaccuracy. Thistypeofpruningisreferredtoasunstructuredpruning.Itarbitrarilyremoves weightconnectionsinaDNN/CNNbutdoeslittletoimproveaccelerationontemporalarchitectureslikecentralprocessingunits(CPUs)andgraphicsprocessing units(GPUs)whichrelyonacceleratingmatrixmultiplications.Anotherform

ofpruning,referredtoasstructuredpruning[50,51],reducesthesizeofweight matricesandmaintainsafullmatrix.ThismakesitpossibletosimplifytheNNA designsincethesparsitypatternsarepredictable,therefore,enablingbetterhardwaresupportforoperationscheduling.

Bothunstructuredandstructuredpruningstrategies,asdescribedabove,result in two-sided sparsity,(i.e.,sparsityinbothweightsandactivations)whichlead toapproximately9× modelreductionforAlexNetand13× reductionforVGG-16. Thepurningstrategiesalsoresultin4–9× effective computereduction(depending onthemodel).Thesegainsseemverypromising;however,designinganacceleratorarchitecturetoleveragethemisquitechallengingbecauseofthefollowing reasons:

● DataAccessInconsistency:Computationgatingisoneofthemostcommon waysbywhichsparsityisgenerallyexploited.Wheneverazerointheactivation ortheweightdataisread,nooperationisperformed.Thisresultsinenergy savingsbuthasnoimpactonthethroughputbecauseofthewastageofcompute cycle.Complexreadlogicneedstobeimplementedtodiscardthezeros,and instead,performeffectivecomputationsonnonzerodata.Someprevious works[52,53]usesparsecompressionformatslikecompressedsparsecolumn (CSC)orcompressedsparserow(CSR)torepresentsparsedata.Theseformats havevariablelengthsandmake lookingahead difficultifboththeweight andtheactivationsparsityarebeingconsidered.Otherthanthat,developing thecomplexcontrolandreadlogictoprocesstheseformatscanbequite challenging.

● LowUtilizationoftheProcessingElement(PE)Array:Convolution operationsforCNNinferenceareusuallyperformedusinganarrayof two-dimensionalPEsinaCNNaccelerator.Differentdataflows(inputstationary,outputstationary,weightstationary,etc.)havebeenproposedthat efficientlymaptheweightdataandtheactivationdataontothePEarrayto maximizethethroughput[6,54].Sparsityintroducesinconsistencyinthe schedulingofdatatherebyreducinghardwareutilization.ThesubsetofPEs providedwithmoresparsedatahaveidletimeswhilethoseprovidedwithless sparse(ordenser)dataarefullyactive.Thisboundsthethroughputofthe acceleratortothemostactivePEs,andtherefore,leadstotheunderutilization ofthePEarray.

Consideringtheabovementionedissues,manyacceleratorshavebeenproposed inthepastthatattempttostrikeabalancebetweenhardwareresourcecomplexityandperformanceimprovements.TheCNNacceleratorsthatexploitsparsityin CNNmodelsarecoveredindetailinPartIVofthisbook.

1.3ChapterSummary

ThischapterdiscussedthehistoryandapplicationsofDNNs,focusingonCNNs. Thechapteralsohighlightedthecomputeandenergybottlenecksaswellasthe effectofsparsityinhigh-accuracyCNNmodelsonthethroughputandhardware utilizationofedgedevices.

Turn static files into dynamic content formats.

Create a flipbook