Data prefetching techniques in computer systems pejman lotfi-kamran & hamid sarbazi-azad - Quickly d by Education Libraries

https://ebookmass.com/product/data-prefetching-techniquesin-computer-systems-pejman-lotfi-kamran-hamid-sarbazi-azad/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Intelligent Systems and Learning Data Analytics in Online Education: A volume in Intelligent Data-Centric Systems Santi Caballé

https://ebookmass.com/product/intelligent-systems-and-learning-dataanalytics-in-online-education-a-volume-in-intelligent-data-centricsystems-santi-caballe/ ebookmass.com

Big Data Application in Power Systems Reza Arghandeh

https://ebookmass.com/product/big-data-application-in-power-systemsreza-arghandeh/

ebookmass.com

Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani

https://ebookmass.com/product/data-science-in-theory-and-practicetechniques-for-big-data-analytics-and-complex-data-sets-maria-cmariani/ ebookmass.com

The Oxford History of Phonology B. Elan Dresher (Editor)

https://ebookmass.com/product/the-oxford-history-of-phonology-b-elandresher-editor/

ebookmass.com

Handbook of Ecological and Ecosystem Engineering Majeti

Narasimha Vara Prasad

https://ebookmass.com/product/handbook-of-ecological-and-ecosystemengineering-majeti-narasimha-vara-prasad/

ebookmass.com

Pathophysiology: The Biologic Basis for Disease in Adults and Children Kathryn L McCance, Sue E. Huether 2019 8th ed

8th Edition Pathophysiology: The Biologic Basis For Disease In Adults And Children Kathryn L Mccance

https://ebookmass.com/product/pathophysiology-the-biologic-basis-fordisease-in-adults-and-children-kathryn-l-mccance-sue-ehuether-2019-8th-ed-8th-edition-pathophysiology-the-biologic-basisfor-disease-in-adults-and-children/ ebookmass.com

College physics : a strategic approach 4th Edition Stuart Field

https://ebookmass.com/product/college-physics-a-strategicapproach-4th-edition-stuart-field/

ebookmass.com

Kinship in Ancient Athens: Two-Volume Set: An Anthropological Analysis S C Humphreys

https://ebookmass.com/product/kinship-in-ancient-athens-two-volumeset-an-anthropological-analysis-s-c-humphreys/

ebookmass.com

Medical Textile Materials 1st Edition Qin

https://ebookmass.com/product/medical-textile-materials-1st-editionqin/

ebookmass.com

https://ebookmass.com/product/understanding-ethiopias-tigray-warmartin-plaut/

ebookmass.com

ADVANCESIN COMPUTERS

DataPrefetchingTechniques inComputerSystems

ADVANCESIN COMPUTERS

DataPrefetchingTechniques inComputerSystems

Editedby PEJMANLOTFI-KAMRAN

SchoolofComputerScience, InstituteforResearchinFundamentalSciences(IPM), Tehran,Iran

HAMIDSARBAZI-AZAD

SharifUniversityofTechnology,and InstituteforResearchinFundamentalSciences(IPM), Tehran,Iran

AcademicPressisanimprintofElsevier

50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates 525BStreet,Suite1650,SanDiego,CA92101,UnitedStates TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom 125LondonWall,London,EC2Y5AS,UnitedKingdom

Firstedition2022

Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronic ormechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem, withoutpermissioninwritingfromthepublisher.Detailsonhowtoseekpermission,further informationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuch astheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions

Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher (otherthanasmaybenotedherein).

Notices

Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedical treatmentmaybecomenecessary.

Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluating andusinganyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuch informationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers,including partiesforwhomtheyhaveaprofessionalresponsibility.

Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assume anyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability, negligenceorotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideas containedinthematerialherein.

ISBN:978-0-323-85119-0 ISSN:0065-2458

ForinformationonallAcademicPresspublications visitourwebsiteat https://www.elsevier.com/books-and-journals

Publisher: Zoe Kruze

DevelopmentalEditor: CindyAngelitaGardose ProductionProjectManager: JamesSelvam CoverDesigner: GregHarris TypesetbySTRAIVE,India

Contributorsvii

Preface ix

Tableofabbreviationsxi

1.Introductiontodataprefetching1

PejmanLotfi-KamranandHamidSarbazi-Azad

1. Introduction2

2. Background3

3. Apreliminaryhardwaredataprefetcher9

4. Nonhardwaredataprefetching11

5. Conclusion12 References13 Furtherreading15 Abouttheauthors16

2.Spatialprefetching19

PejmanLotfi-KamranandHamidSarbazi-Azad

1. Introduction19

2. Spatialmemorystreaming(SMS)21

3. Variablelengthdeltaprefetcher(VLDP)23

4. Summary27 References27 Abouttheauthors29

3.Temporalprefetching31

PejmanLotfi-KamranandHamidSarbazi-Azad

1. Introduction31

2. Sampledtemporalmemorystreaming(STMS)33

3. Irregularstreambuffer(ISB)35

4. Summary38 References38 Abouttheauthors41

4.Beyondspatialortemporalprefetching43

PejmanLotfi-KamranandHamidSarbazi-Azad

1. Introduction43

2. Spatiotemporalmemorystreaming(STeMS)44

3. Summary50 References50 Abouttheauthors52

5.State-of-the-artdataprefetchers55

MehranShakerinava,FatemehGolshan,AliAnsari,PejmanLotfi-Kamran, andHamidSarbazi-Azad

1. DOMINO temporaldataprefetcher55

2. BINGO spatialdataprefetcher57

3. Multi-lookaheadoffsetprefetcher59

4. Runaheadmetadata62

5. Summary64 References64 Abouttheauthors65

6.Evaluationofdataprefetchers69

MehranShakerinava,FatemehGolshan,AliAnsari,PejmanLotfi-Kamran, andHamidSarbazi-Azad

1. Introduction70

2. Spatialprefetching70

3. Temporalprefetching73

4. Spatio-temporalprefetching77

5. Offsetprefetching82

6. Multi-degreeprefetchingwithpairwise-correlatingprefetchers83

7. Summary85 References85 Abouttheauthors87

Contributors

AliAnsari

SharifUniversityofTechnology,Tehran,Iran

FatemehGolshan

SharifUniversityofTechnology,Tehran,Iran

PejmanLotfi-Kamran

SchoolofComputerScience,InstituteforResearchinFundamentalSciences(IPM),Tehran, Iran

HamidSarbazi-Azad

SharifUniversityofTechnology,andInstituteforResearchinFundamentalSciences(IPM), Tehran,Iran

MehranShakerinava

SharifUniversityofTechnology,Tehran,Iran

Preface

AdvancesinComputers,theoldestseriestothechronicleoftherapid evolutionofcomputing,annuallypublishesseveralvolumes,eachcomprisingtypicallyfourtoeightchapters,describingnewfindingsanddevelopmentsinthetheoryandapplicationsofcomputing.

This125thvolumeisathematiconeentitled“DataPrefetching TechniquesinComputerSystems”inspiredbyrecentworkindataprefetching techniquesproposedforadvancedprocessorsandusedincomputersystems. Thisvolumecomprisessixchapters.

Whileprocessorperformancehasbeenrapidlyimprovingduringthepast decades,themaingoalofdesignerswastoincreasethecapacityofthemain memoryratherthanitsspeed.Therefore,thereemergedaconsiderablegap, calledmemorywall,betweenthespeedoftheprocessorandmainmemory. Totacklethememorywallproblem,ahigh-speedlow-capacitybuffercalled cachememoryisusedbetweentheprocessorandthemainmemorytoremedytheirperformancegap.Whenaprocessorrequiresapieceofdata,ifitis inthecache(calledahit),itcanbedeliveredtotheprocessorveryquickly. Otherwise,whichiscalledamiss,thedatahastobeaccessedinthemain memory.Insuchcases,cachememoryisnotusefulandevenincreases theaccesslatencyaswefirstneedtosearchthecacheandthenaccessthe mainmemory.Asthemainmemorylatencyisconsiderableforaprocessor, loweringtherequestsservedbythememoryisdesirable.Dataprefetchingis awaytoimprovetheeffectivenessofcachesandreducestheperformance gapbetweentheprocessorandthemainmemorybyloweringthestall cyclesseenbytheprocessorwhenservingmainmemoryrequests.Adata prefetcherpredictsthatapieceofdataisgoingtobeusedbytheprocessor. Incasethepredicteddataisnotinthecache,thedataprefetcherbringsitinto thecacheinadvance.

Themaintaskofadataprefetcheristopredictfuturememoryaccesses. Fortunately,thememoryaccessesofaprocessorusuallyareregularand followpatterns.Bylearningaccesspatterns,adataprefetchercanmake correctpredictions.Basedondifferentaccesspatternsthatadataprefetcher canlearn,differenttypesofdataprefetchersmaybedesignedandused.This bookintroducesseveralimportanttypesofdataprefetchersanddiscusses howtheyperformdataprefetching,whytheyworkasintended,whenthey areeffective,andwhentheyarenot.

Preface

Inthisbook,wefirstdiscussthebasicdefinitionsandfundamentalconceptsofdataprefetchingandthenstudyclassichardwaredataprefetchers,as wellasrecentadvanceddataprefetchers.Wedescribetheoperationsofevery dataprefetcherindetailandshedlightonitsdesigntrade-offs.

Thesixchaptersofthebookareorganizedasfollows: Chapter1,an introductiontodataprefetchingisgiven.Differentprefetchersaredefined andmanyrecentprefetchersarecategorizedaswell. Chapter2 discusses spatialprefetchingtechniquesandintroducesSMSandVLDPtechniques indetail. Chapter3 discussestemporaldataprefetchingandexplains STMSandISB.Techniquesthatcannotbeclassifiedunderonlyspatial andtemporaldataprefetchersareconsideredin Chapter4. Chapter5 introducessomerecentlyproposeddataprefetchers.Finally, Chapter6 evaluates theeffectivenessofdifferentprefetchersundervariousworkingconditions.

Wehopethatthereaderswillfindthisvolumeinterestinganduseful forteaching,research,anddesigningdataprefetchers.Wewelcomeany feedbackandsuggestionsrelatedtothebook.

PEJMAN LOTFI-KAMRAN SchoolofComputerScience

InstituteforResearchinFundamentalSciences(IPM) Tehran,Iran

HAMID SARBAZI-AZAD SharifUniversityofTechnology, and

InstituteforResearchinFundamentalSciences(IPM), Tehran,Iran

Tableofabbreviations

AMT accessmaptable

BOP best-offsetprefetcher

CPU centralprocessingunit

DHB deltahistorybuffer

DPC dataprefetchingchampionship

DPT deltapredictiontable

DRAM dynamicrandomaccessmemory

DSS decisionsupportsystem

EIT enhancedindextable

GPU graphicsprocessingunit

IPC instructionspercycle

ISB irregularstreambuffer

L1D level1data

LLC last-levelcache

LRU leastrecentlyused

MLOP multi-lookaheadoffsetprefetcher

NVM non-volatilememory

OLTP onlinetransactionprocessing

OoO outoforder

OPT offsetpredictiontable

OS operatingsystem

PAS physicaladdressspace

PC programcounter

PHT patternhistorytable

PSAM physical-to-structuraladdressmapping

PST patternssequencetable

RMD runaheadmetadata

RMOB regionmissorderbuffer

ROB reorderbuffer

RRT recentrequeststable

SAS structuraladdressspace

SMS spatialmemorystreaming

SP sandboxprefetcher

SPAM structural-to-physicaladdressmapping

SPU sandboxprefetchunit

STeMS spatio-temporalmemorystreaming

STMS sampledtemporalmemorystreaming

TLB translationlookasidebuffer

TPC transactionprocessingperformancecouncil

VLDP variablelengthdeltaprefetcher

Introductiontodataprefetching

Formanyyears,computerdesignersbenefittedfromMoore’slawtosignificantly improvethespeedofprocessors.Unlikeprocessors,withthemainmemory,thefocus ofimprovementhasbeencapacityandnotaccesstime.Hence,thereisalargegap betweenthespeedofprocessorsandmainmemory.Tohidethegap,ahierarchyof cacheshasbeenusedbetweenaprocessorandmainmemory.Whilecachesproved tobequiteeffective,theireffectivenessdirectlydependsonhowmanytimesa requestedpieceofdatacanbefoundinthecache.Duetothecomplexityofdataaccess patterns,onmanyoccasions,therequestedpieceofdatacannotbefoundinthecache hierarchy,whichexposeslargedelaystoprocessorsandsignificantlydegradestheirperformance.Dataprefetchingisthescienceandartofpredicting,inadvance,whatpieces ofdataaprocessorneedsandbringingthemintothecachebeforetheprocessor requeststhem.Inthischapter,wejustifytheimportanceofdataprefetching,lookat ataxonomyofdataprefetching,andbrieflydiscusssomesimplewaystodoprefetching.

WearelivinginanerainwhichInformationTechnology(IT)isshapingoursociety.Morethananytimeinhistory,oursocietyisdependentonIT foritsday-to-dayactivities.Education,media,science,socialnetworking,etc. areallaffectedbyIT [1].Thesteadygrowthinprocessorperformanceisoneof thedrivingforcesbehindthesuccessandwidespreadadoptionofIT.

Historically,theprocessorperformancehadbeenimprovedbyafactor closetotwoevery2years(thisphenomenonsometimesmistakenlyis referredtoasMoore’slaw [2]).Whileprocessorperformancewasrapidly improving, thegoalofdesignerswastorapidlyincreasethecapacityof theDRAM,whichisthebuildingblockofmainmemory.Asthespeed ofDRAMwasasecond-leveloptimizationfordesigners,overtime,there emergedaconsiderablegapbetweenthespeedofprocessorsandmain memorytowhichmanyreferas memorywall.

Since2004,therateatwhichtheprocessorperformanceisimprovinghas beenreduced [3].Duetothereductionintheannualimprovementinprocessor performance,thememorywallisnotwideningasinthelasttwo decadesofthe20thcentury.Nevertheless,thememorywallisconsiderable andhurtsprocessorperformance.

Tohidethedelayofthemainmemory,historically,acacheisused betweentheprocessorandthemainmemory.Acacheisasmallandfast hardware-managedstoragebuildfromSRAM,whichismuchfasterand requiresmorespacethanDRAM.Ascachesizeisverysmall,onlypartof thedatacanbestoredinthecacheatanygivenpointintime.Whenaprocessorasksforapieceofdata,ifitisinthecache(calledahit),itcanbedeliveredtotheprocessorveryquickly.However,ifthedataisnotinthecache (calledamiss),wehavetogetthedataelsewhere(e.g.,mainmemory).In suchcases,notonlythecacheisnotusefulbutalsoitincreasestheaccess latencyaswefirstneedtosearchthecacheandthenlookforthepieceof dataelsewhere.

Toincreasethelikelihoodoffindingthedatainthecache,contemporary processorsbenefitfromahierarchyofcaches.Theseprocessorsusemultiple levelsofcaches.Thecachethatisclosesttotheprocessoristhesmallestand thefastest.WeusuallyrefertothiscacheastheL1cache.Aswefurtheraway fromtheprocessor,thecachesbecomelargerandslower.Werefertothese cachesasL2,L3,etc.Thecacheattheendofthecachehierarchy,whichis closesttothemainmemory,isusuallyreferredtoasthe lastlevelcache (LLC).

Thecachehierarchyisquiteeffectiveandservesmanyoftheprocessor memoryrequests.Nonetheless,thecachehierarchycannothandleallofthe memoryrequests.Foreveryrequestthatgoestothemainmemory,theprocessorisexposedtothewholedelayofaccessingmemory.Giventhatthe delayofmainmemoryisconsiderableforaprocessor,amechanismthat lowersthenumberofrequestsservedbythememoryisquiteuseful.

Dataprefetchingisawaytoimprovetheeffectivenessofcachesand lowerthestallcyclesduetoservingprocessorrequestsbythemain memory.Adataprefetcherpredictsthatapieceofdataisgoingtobeused bytheprocessor.Basedonthisprediction,incasethepredicteddataisnotin thecache,thedataprefetcherbringsitintothecache.Ifthepredictionis correct,acachemissisavoided,andtheprocessorfindsthedatainthecache whenitisneeded.Otherwise,auselesspieceofdataisbroughtintothe cache,andpossiblyausefulpieceofdataisevictedfromthecachetoopen roomforthepredicteddata.

Themainpartofadataprefetcherishowtopredictfuturememory accessesofaprocessor.Fortunately,thememoryaccessesofaprocessorusuallyareregularandfollowpatterns(asweseeinthisbook).Bylearninga patternandusingitforprediction,adataprefetchercanmakecorrect predictions.

Therearemanypatternsthatadataprefetchermaylearn,andhence, therearemanydifferenttypesofdataprefetchers.Inthisbook,weintroduce severalimportanttypesofdataprefetchers.Wediscusshowtheyperform dataprefetchingandwhytheyworkasintended.Wealsomentionwhen theyareeffectiveandwhentheyarenot.Inthisbook,wefirstdiscuss thefundamentalconceptsindataprefetchingthenstudyrecent,aswellas classic,hardwaredataprefetchers.Wedescribetheoperationsofeverydata prefetcher,indetail,andshedlightonitsdesigntrade-offs.

2.Background

Inthissection,webrieflyreviewsomebackgroundonhardwaredata prefetching.

2.1Predictingmemoryreferences

Thefirststepindataprefetchingispredictingfuturememoryaccesses. Fortunately,dataaccessesdemonstrateseveraltypesofcorrelationsandlocalities,whichleadtotheformationofpatternsamongmemoryaccesses,from whichdataprefetcherscanpredictfuturememoryreferences.Thesepatterns

emergefromthelayoutofprograms’datastructuresinthememory,andthe algorithmandthehigh-levelprogrammingconstructsthatoperateonthese datastructures.Inthischapter,webrieflymentionthreeimportantmemory accesspatternsofapplications:(1)stride,(2)temporal,and(3)spatialaccess patterns.

2.1.1Strideaccesses

Strideaccesspatternreferstoasequenceofmemoryaccessesinwhichthe distanceofconsecutiveaccessesisconstant,e.g.,{A,A+k,A+2k, …}with stridek.Suchpatternsarefrequentinprogramswithdensematricesandfrequentlycomeintosightwhenprogramsoperateonmulti-dimensional arrays.Pleaseconsiderthefollowingpieceofcode. Pseudocode1.Asimpleprogramthatcalculatesthesumofanarrayofbytes.

byteA[100] //somecomputationwithA ...

SUM ¼ 0 for(inti ¼ 0;i < 100;i++)

SUM+¼ A[i]

Inthisexample,allelementsofarray A arereadtobeaddedtothevariable SUM,whichholdstheaggregatesumofarray A.Asallelementsofan arrayareplacedoneafteranotherinmemory,if A[0] isplacedataddress addr, A[1] isplacedat addr+1.Similarly,therestoftheelementsareplacedat addr +2, addr+3, …, addr+99.Duetothisparticularplacementofarray A inthe memory,andthenatureofthealgorithm, Pseudocode1,duringexecution, generates memoryreferences addr,addr+1, …, and addr+99,andhence exhibitsasimplestrideaccesspatternwithastrideof1.Ifwereplace the“for(inti ¼ 0;i < 100;i++)”statementinthecodeto“for(inti ¼ 0; i < 100;i+ ¼ 2),”thecodereferencesaddresses addr, addr+2, addr+4, …,0 addr+98,andhenceexhibitsanaccesspatternwithastrideof2.

Whilestrideaccessesareabundantinarrayandmatrixdatastructures, theyarenotuniquetosuchdatastructures.Strideaccessesalsoappearin pointer-baseddatastructureswhenmemoryallocatorsarrangetheobjects sequentiallyandinaconstant-sizemannerinthememory [1].

2.1.2 Temporaladdresscorrelation

Temporaladdresscorrelation [2] referstoasequenceofaddressesthatfavor being accessedtogetherandinthesameorder.Forexample,ifweobserve

{A,B,C,D},thenitislikelyfor{B,C,D}tofollow{A}inthefuture. Temporaladdresscorrelationstemsfundamentallyfromthefactthat programsconsistofloops,andisobservedwhendatastructuressuchaslists, arrays,andlinklistsaretraversed.Whendatastructuresarestable [4],access patterns recur,andthetemporaladdresscorrelationismanifested [2].Please consider thefollowingpieceofcode.

Pseudocode2.Asimpleprogramthatcalculatesthesumofalinkedlist ofbytes.

structELEMENT_T{byteB;pointerP};

ELEMENT_T * e; //createandmanupulatealinklistofELEMENT_Tthatepointsto ...

ELEMENT_T * p ¼ e; while(p!¼ NULL)

SUM+¼ p.B; p ¼ p.pointer

Thefunctionalityof Pseudocode2 issimilarto Pseudocode1.Themajor difference isduetoreplacingthearrayin Pseudocode1 withalinkedlist.As every elementofalinkedlistcanbeanywhereinthememory,thereis usuallynomanifestationofastrideaccesspatternwhenthecodeisbeing executed.However,ifwetraversethelinkedlistmultipletime,e.g.,the codemakessomemodificationinthelinkedlistandthenattemptstodeterminethetotalsum,thesequenceofmemoryaccessesduetolinkedintraversalwillbethesameastheprevioustimeexceptfordifferencesdueto additionordeletionofcertainelementsinthelinkedlist.Ifthechangein thelinkedlistisnotsignificant,mostofthesequenceofmemoryreferences areidenticalacrossmultipletraversalsofthelinkedlist,whichleadto temporalaccesspatternmanifestation.

2.1.3Spatialaddresscorrelation

Spatialaddresscorrelation [3] referstothephenomenonthatsimilaraccess patterns occurindifferentregionsofmemory.Forexample,ifaprogram visitslocations{A,B,C,D}ofpageX,itisprobablethatitvisitslocations {A,B,C,D}ofotherpagesaswell.Spatialcorrelationtranspiresbecause applicationsusevariousobjectswitharegularandfixedlayout,andaccesses reappearwhiletraversingdatastructures [3].Tobetterunderstandthe concept, pleaseconsiderthefollowingpieceofcode.

Pseudocode3.Asimpleprogramthatcalculatesthesumoftwoelementsofa linkedlist.

structELEMENT_T{bytea;byteA1[100];byteb;byteA2[20];bytec; pointerp;}

ELEMENT_T * e; //createandmanupulatealinklistofELEMENT_Tthatepointsto

ELEMENT_T * p ¼ e; while(p!¼ NULL).

SUM+¼ c;

SUM+¼ b;

SUM+¼ a; p ¼ p.pointer

Thecodeissimilarto Pseudocode2.Themajordifferenceisthateach element ofthelinkedlistisalargestructureconsistingoftwoarraysandthree scalarvariablesinadditiontothepointer.Thepieceofcode,however,only caresaboutthethreescalarvariablesandattemptstocalculatethesumofall thescalarvariablesinthelinkedlist.Asthedatastructureisalinkedlistand henceallofitselementsmaybeindifferentlocationsinthemainmemory, thecodelikelydoesnotexhibitstrideaccesspatterns.Moreover,unlikethe lastexample,assumethatweplantotraversethelinkedlistonce.Assuch, thecodedoesnotexhibittemporalaccesspatternsaswell.Nonetheless, whileeachelementofthelinkedlistmightbeinanylocationinthemain memory,oncethecodetouchesvariable c,italsotouchesvariables b and a. Moreover,asthesevariablesarelocatedinthemainmemorywithafixed layoutdictatedbytheELEMENT_Tstructure,whenthefirstelementis touched,thelocationoftheothertwocanbeinferred.Thisaccesspattern iscalledaspatialaccesspattern.

2.2Prefetchinglookahead

Prefetchersneedtoissuetimelyprefetchrequestsforthepredictedaddresses. Preferably,aprefetchersendsprefetchrequestswellinadvanceandsupplies enoughstoragefortheprefetchedblockstohidetheentirelatencyofmemory accesses.Anearlyprefetchrequestmaycauseevictingausefulblockfromthe cache,andalateprefetchmaydecreasetheeffectivenessofprefetchinginthat aportionofthelonglatencyofmemoryaccessisexposedtotheprocessor.

Prefetchinglookaheadreferstohowfaraheadofthedemandmissstream theprefetchercansendrequests.Anaggressiveprefetchermayofferahigh prefetchinglookahead(say,8)andissuemanyprefetchrequestsaheadofthe processordemandrequeststohidetheentirelatencyofmemoryaccesses; ontheotherhand,aconservativeprefetchermayofferalowprefetching lookaheadandsendasingleprefetchrequestinadvanceoftheprocessor’s demandtoavoidwastingresources(e.g.,cachestorageandmemorybandwidth).Typically,thereisatrade-offbetweentheaggressivenessofa prefetchingtechniqueanditsaccuracy:makingaprefetchermoreaggressive usuallyleadstocoveringmoredata–miss–inducedstallcyclesbutatthecost offetchingmoreuselessdata.

Somepiecesofpriorworkproposetodynamicallyadjusttheprefetching lookahead [5–7].Basedontheobservationthattheoptimalprefetching degree isdifferentforvariousapplicationsandvariousexecutionphasesof aparticularapplication,aswell,theseapproachesemployheuristicsto increaseordecreasetheprefetchinglookahead.Forexample,SPP [6] monitors theaccuracyofissuedprefetchrequestsandreducetheprefetching lookaheadiftheaccuracybecomessmallerthanapredefinedthreshold.

2.3Locationofdataprefetcher

Prefetchingcanbeemployedtomovethedatafromlowerlevelsof thememoryhierarchytoanyhigherlevel.Priorworkuseddataprefetchers atallcachelevels,fromtheprimarydatacachetothesharedlast-level cache.

Thelocationofadataprefetcherhasaprofoundimpactonitsoverall behavior [8].Aprefetcherinthefirst-levelcachecanobserveallmemory accesses, andhence,canissuehighlyaccurateprefetchrequests,butatthe costofimposinglargestorageoverheadforrecordingthemetadatainformation.Incontrast,aprefetcherinthelast-levelcacheobservestheaccess sequencesthathavebeenfilteredathigherlevelsofthememoryhierarchy, resultinginlowerpredictionaccuracy,buthigherstorageefficiency.

2.4Prefetchinghazards

Anaivedeploymentofadataprefetchernotonlymaynotimprovethe systemperformancebutalsomaysignificantlyharmtheperformanceand energyefficiency [9].Thetwowell-knownmajordrawbacksofdata prefetching are(1)cachepollutionand(2)off-chipbandwidthoverhead.

2.4.1Cachepollution

Dataprefetchingmayincreasethedemandmissesbyreplacingusefulcache blockswithuselessprefetcheddata,harmingtheperformance.Cachepollutionusuallyoccurswhenanaggressiveprefetcherexhibitslowaccuracy and/orwhenprefetchrequestsofacoreinamany-coreprocessorcompete forsharedresourceswithdemandaccessesofothercores [10].

2.4.2 Bandwidthoverhead

Inamany-coreprocessor,prefetchrequestsofacorecandelaydemand requestsofanothercorebecauseofcontendingformemorybandwidth [10].Thisinterferenceisthemajorobstacleofusingdataprefetchersin many-core processors,andtheproblemgetsthornierasthenumberofcores increases [11,12].

2.4.3 Placingprefetcheddata

Dataprefetchersusuallyplacetheprefetcheddataintooneofthefollowing twostructures:(1)thecacheitself,and(2)anauxiliarybuffernexttothe cache.Incaseanauxiliarybufferisusedfortheprefetcheddata,demand requestsfirstlookforthedatainthecache;ifthedataisnotfound,theauxiliarybufferissearchedbeforesendingarequesttothelowerlevelsofthe memoryhierarchy.

Storingtheprefetcheddataintothecachelowersthelatencyofaccessing datawhenthepredictioniscorrect.However,whenthepredictionisincorrectorwhentheprefetchrequestisnottimely(i.e.,tooearly),havingthe prefetcheddatainthecachemayresultinevictingusefulcacheblocks.

2.5Prefetchertypes

Thereareseveraltypesofdataprefetchers.Ataveryhighlevel,dataprefetcherscanbeclassifiedintohardwareprefetchersandnonhardware prefetchers.Ahardwareprefetcherisadataprefetchingtechniquethatis implementedasahardwarecomponentinaprocessor.Anyotherprefetching techniqueisanonhardwareprefetcher. Fig.1 showsaclassificationofdata prefetchingtechniques.

Wefocusonhardwaredataprefetchingtechniquesinthisbook.As shownin Fig.1,Hardwaredataprefetcherscanbeclassifiedintospatial, temporal, andnonspatial-temporalprefetchers.Wecoverconventional spatialprefetchersinchapter“Spatialprefetching”byLotfi-Kamranand Sarbazi-Azadandstate-of-the-artspatialprefetchersinchapter“State-ofthe-artdataprefetchers”byShakerinavaetal.Conventionalandstate-of-

2 & 5Chapters 3 & 5Chapter

Fig.1 Aclassificationofvariousdataprefetchingtechniques(withanemphasison hardware dataprefetchers).

the-arttemporalprefetchersarecoveredinchapters“Temporal prefetching”byLotfi-KamranandSarbazi-Azad;“State-of-the-artdata prefetchers”byShakerinavaetal.,respectively.Non-spatial-temporalprefetchersarecoveredinchapter“Beyondspatialortemporalprefetching”by Lotfi-KamranandSarbazi-Azad.Weevaluatevarioustypesofhardwaredata prefetchersinchapter“Evaluationofdataprefetchers”byShakerinavaetal. toempiricallyassesstheirstrengthsandweaknesses.

Fig.1 alsoshowsaclassificationofthenon-hardwaredataprefetchers.As this bookisabouthardwaredataprefetching,wedonotcovernonhardwaredataprefetchingtechniques.However, Section5 ofthischapter reviews variousnon-hardwaredataprefetchingtechniquesandoffersashort explanationforthem.

3.Apreliminaryhardwaredataprefetcher

Togiveinsightonhowastereotypeoperates,nowwedescribea preliminary-yet-prevalenttypeofstrideprefetching.Generally,strideprefetchersarewidelyusedincommercialprocessors(e.g.,IBMPower4 [13],IntelCore [14],AMDOpteron [15],SunUltraSPARCIII [16]) and havebeenshownquiteeffectivefordesktopandengineeringapplications.Strideprefetchers [4,17–23] detectstreams(i.e.,thesequenceofconsecutive addresses)thatexhibitstrideaccesspatternsandgenerateprefetch requestsbyaddingthedetectedstridetothelastobservedaddress.

Instruction-BasedStridePrefetcher(IBSP) [17] isapreliminarytypeof stride prefetching.Theprefetchertracksthestridestreamsonaperload instructionbasis:theprefetcherobservesaccessesissuedbyindividualload instructionsandsendsprefetchrequestsiftheaccessesmanifestastridepattern. Fig.2 showstheorganizationofIBSP’smetadatatable,named

Fig.2 TheorganizationofInstruction-BasedStridePrefetcher(IBSP).The ‘RPT’ keeps track ofvariousstreams.

ReferencePredictionTable(RPT).RPTisastructuretaggedandindexed withtheprogramcounter(PC)ofloadinstructions.EachentryintheRPT correspondstoaspecificloadinstruction;itkeepstheLastBlockreferenced bytheinstructionandtheLastStrideobservedinthestream(i.e.,the distanceoftwolastaddressesaccessedbytheinstruction).

Uponeachtriggeraccess(i.e.,acachemissoraprefetchhit),theRPTis searchedwiththePCoftheinstruction.Ifthesearchresultsinamiss,it meansthatnohistorydoesexistfortheinstruction,andhence,noprefetch requestcanbeissued.Undertwocircumstances,asearchmayresultina miss:(1)wheneveraloadinstructionisanewoneintheexecutionflow oftheprogram,andergo,nohistoryhasbeenrecordedforitsofar,and (2)wheneveraloadinstructionisre-executedafteralongtime,andthe correspondingrecordedmetadatainformationhasbeenevictedfromthe RPTduetoconflicts.Insuchcaseswhennomatchingentrydoesexist intheRPT,anewentryisallocatedfortheinstruction,andpossiblyavictim entryisevicted.ThenewentryistaggedwiththePC,andtheLastBlock fieldoftheentryisfilledwiththereferencedaddress.TheLastStrideisalso settozero(aninvalidvalue)asnostridehasyetbeenobservedforthis stream.However,ifsearchingtheRPTresultsinahit,itmeansthatthere isarecordedhistoryfortheinstruction.Inthiscase,therecordedhistory informationischeckedwiththecurrentaccesstofindoutwhetherornot thestreamisastrideone.Todoso,thedifferenceofthecurrentaddress andtheLastBlockiscalculatedtogetthecurrentstride.Then,thecurrent strideischeckedagainsttherecordedLastStride.Iftheydonotmatch,itis impliedthatthestreamdoesnotexhibitastrideaccesspattern.However,if theymatch,itisconstruedthatthestreamisastrideoneasthreeconsecutive accesseshaveproducedtwoidenticalstrides.Inthiscase,basedonthe lookaheadoftheprefetcher(Section2.2),severalprefetchrequestscanbe issued byconsecutivelyaddingtheobservedstridetotherequestedaddress.

Forexample,ifthecurrentaddressandstrideareAandk,respectively,and thelookaheadofprefetchingisthree,prefetchcandidateswillbe{A+k, A+k+k,A+k+k+k}.Finally,regardlessofthefactthatthestreamisstride ornot,thecorrespondingRPTentryisupdated:theLastBlockisupdated withthecurrentaddress,andtheLastStridetakesthevalueofthecurrent stride.

4.Nonhardwaredataprefetching

Progressintechnologyfabricationaccompaniedbycircuit-leveland microarchitecturaladvancementshasbroughtaboutsignificantenhancementsintheprocessors’performanceoverthepastdecades.Meanwhile, theperformanceofmemorysystemshasnotimprovedinpastewiththat oftheprocessors,formingalargegapbetweentheperformanceofprocessors andmemorysystems [13–17].Asaconsequence,numerousapproaches have beenproposedtoenhancetheexecutionperformanceofapplications bybridgingtheprocessor-memoryperformancegap.Hardwaredata prefetchingisjustoneoftheseapproaches.Hardwaredataprefetchingbridgesthegapbyproactivelyfetchingthedataaheadofthecores’requeststo eliminatetheidlecyclesinwhichtheprocessoriswaitingfortheresponseof thememorysystem.Inthissection,webrieflyreviewtheotherapproaches thattargetthesamegoal(i.e.,bridgingtheprocessor-memoryperformance gap)butinotherways.

Multithreading [18] enablestheprocessortobetterutilizeitscomputationalresources,asstallsinonethreadcanbeoverlappedwiththeexecution ofotherthreads [4,19,20].Multithreading,however,onlyimprovesthroughputanddoesnothingfor(orevenworsens)theresponsetime [15,21–23], whichiscrucialforsatisfyingthestrictlatencyrequirementsofserver applications.

Thread-BasedPrefetchingtechniques [24–28] exploitidlethreadcontexts ordistinctpre-executionhardwaretodrivehelperthreadsthattry tooverlapthecachemisseswithspeculativeexecution.Suchhelperthreads, formedeitherbythehardwareorbythecompiler,executeapieceofcode thatprefetchesforthemainthread.Nonetheless,theadditionalthreadsand fetch/executionbandwidthmaynotbeavailablewhentheprocessorisfully utilized.Theabundantrequest-levelparallelismofserverapplications [22–24] makessuchschemesineffectiveinthatthehelperthreadsneedto compete withthemainthreadsforthehardwarecontext.

RunaheadExecution [29–31] makestheexecutionresourcesofacore that wouldotherwisebestalledonanoff-chipcachemisstogoaheadof thestalledexecutioninanattempttodiscoveradditionalloadmisses. Similarly,BranchPredictionDirectedPrefetching [5] utilizesthebranch predictor toruninadvanceoftheexecutingprogram,therebyprefetching loadinstructionsalongtheexpectedfuturepath.Suchapproaches,nevertheless,areconstrainedbytheaccuracyofthebranchpredictorandcan coversimplyaportionofthemisslatency,sincetherunaheadthread/branch predictormaynotbecapableofexecutingfaraheadinadvanceto completelyhideacachemiss.Moreover,theseapproachescanonlyprefetch independentcachemisses [32] andmaynotbeeffectiveformanyofthe server workloads,e.g.,OLTPandWebapplications,thatarecharacterized bylongchainsofdependentmemoryaccesses [2,33]

On thesoftwareside,thereareeffortstorestructureprogramstoboost chip-levelDataSharingandDataReuse [34,35] inordertodecreaseoff-chip accesses. Whilethesetechniquesareusefulforworkloadswithmodest datasets,theyfallshortofefficiencyforbig-dataserverworkloads,where themultigigabyteworkingsetsofworkloadsdwarfthefewmegabytesof on-chipcachecapacity.Theevergrowingdatasetsofserverworkloadsmake suchapproachesunscalable.SoftwarePrefetchingtechniques [36–41] profile the programcodeandinsertprefetchinstructionstoeliminatecachemisses. Whilethesetechniquesareshowntobebeneficialforsmallbenchmarks, theyusuallyrequiresignificantprogrammerefforttoproduceoptimized codetogeneratetimelyprefetchrequests.

Memory-SidePrefetchingtechniques [42–44] placethehardwarefor data prefetchingnearDRAM,forthesakeofsavingpreciousSRAMbudget.Insuchapproaches(e.g., [43]),prefetchingisperformedbyauserthread running neartheDRAM,andprefetchedpiecesofdataaresenttothe on-chipcaches.Unfortunately,suchtechniqueslosethepredictabilityof corerequests [45] andareincapableofperformingcache-leveloptimizations (e.g., avoidingcachepollution [7]).

5.Conclusion

Datacachemissesareamajorsourceofperformancedegradationin applications.Dataprefetchingisawidely-usedtechniqueforreducingthe numberofdatacachemissesortheirnegativeeffects.Dataprefetchersusuallybenefitfromcorrelationsandlocalitiesamongdataaccessestopredict futurememoryreferences.Asthereexistseveraltypesofcorrelationsamong

dataaccesses,thereareseveraltypesofdataprefetchers.Inthisbook,we introduceseveralimportantclassesofdataprefetchersandhighlighttheir strengthsandweaknesses.

References

[1]K.R.Lee,ImpactsofInformationTechnologyonSocietyintheNewCentury,2001. https://www.zurich.ibm.com/pdf/news/Konsbruck.pdf .

[2] G.E.Moore,Crammingmorecomponentsontointegratedcircuits,Electronics 38 (8)(1965)114–117.

[3] D.Geer,Chipmakersturntomulticoreprocessors,Computertomographie38(5) (2005) 11–13.

[4] J.-L.Baer,T.-F.Chen,Aneffectiveon-chippreloadingschemetoreducedataaccess penalty, in: ProceedingsoftheACM/IEEEConferenceonSupercomputing,1991.

[5] F.Dahlgren,P.Stenstrom,Hardware-basedstrideandsequentialprefetchingin shared-memory multiprocessors,in: ProceedingsoftheInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),1995.

[6] P.Lotfi-Kamran,H.Sarbazi-Azad,M.Bakhshalipour,Dominotemporaldata prefetcher, in: ProceedingsoftheInternationalSymposiumonHigh-PerformanceComputer Architecture(HPCA),2018.

[7] M.Shakerinava,P.Lotfi-Kamran,H.Sarbazi-Azad,M.Bakhshalipour,Bingospatial data prefetcher,in: ProceedingsoftheInternationalSymposiumonHigh-Performance ComputerArchitecture(HPCA),2019.

[8] J.Kim,P.Sharma,R.Panda,P.Gratz,D.Jimenez,D.Kadjo,B-fetch:branchprediction directedprefetchingforChip-multiprocessors,in: ProceedingsoftheInternational SymposiumonMicroarchitecture(MICRO),2014.

[9] Z.Fang,A.Zhai,P.Yew,S.Mehta,Multi-stagecoordinatedprefetchingfor present-day processors,in: ProceedingsoftheInternationalConferenceonSupercomputing (ICS),2014.

[10] O.Mutlu,H.Kim,Y.N.Patt,S.Srinath,Feedbackdirectedprefetching:improvingthe performance andbandwidth-efficiencyofhardwareprefetchers,in: Proceedingsofthe InternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),2007.

[11] S.H.Pugsley,P.V.Gratz,A.L.NarasimhaReddy,C.Wilkerson,Z.Chishti,J.Kim,Path confidencebasedlookaheadprefetching,in: ProceedingsoftheInternationalSymposiumon Microarchitecture(MICRO),2016.

[12] P.Lotfi-Kamran,A.Mazloumi,F.Samandi,M.Naderan-Tahan,M.Modarressi,S.-A. Hamid, M.Bakhshalipour,Fastdatadeliveryformany-Coreprocessors,IEEETrans. Comput67(10)(2018)1416–1429.

[13] O.Mutlu,C.J.Lee,Y.N.Patt,E.Ebrahimi,Coordinatedcontrolofmultipleprefetchers inmulti-Coresystems,in: ProceedingsoftheInternationalSymposiumonMicroarchitecture (MICRO),2009.

[14] C.J.Lee,O.Mutlu,Y.N.Patt,E.Ebrahimi,Fairnessviasourcethrottling:Aconfigurable andhigh-performancefairnesssubstrateformulti-Corememorysystems, in: ProceedingsoftheInternationalConferenceonArchitecturalSupportforProgramming LanguagesandOperatingSystems(ASPLOS),2010.

[15] O.DongsuHan,M.H.-B.Mutlu,Y.Kim,ATLAS:Ascalableandhigh-performance scheduling algorithmformultiplememorycontrollers,in: ProceedingsoftheInternational SymposiumonHigh-PerformanceComputerArchitecture(HPCA),2010.

[16] J.SteveDodson,J.Fields,H.Q.Le,B.Sinharoy,J.M.Tendler,POWER4system microarchitecture, IBMJ.Res.Develop.46(1)(2002)5–25.