ADVANCESIN COMPUTERS
DataPrefetchingTechniques inComputerSystems
Editedby PEJMANLOTFI-KAMRAN
SchoolofComputerScience, InstituteforResearchinFundamentalSciences(IPM), Tehran,Iran
HAMIDSARBAZI-AZAD
SharifUniversityofTechnology,and InstituteforResearchinFundamentalSciences(IPM), Tehran,Iran
AcademicPressisanimprintofElsevier
50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates 525BStreet,Suite1650,SanDiego,CA92101,UnitedStates TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom 125LondonWall,London,EC2Y5AS,UnitedKingdom
Firstedition2022
Copyright©2022ElsevierInc.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronic ormechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem, withoutpermissioninwritingfromthepublisher.Detailsonhowtoseekpermission,further informationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuch astheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions
Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher (otherthanasmaybenotedherein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedical treatmentmaybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluating andusinganyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuch informationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers,including partiesforwhomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assume anyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability, negligenceorotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideas containedinthematerialherein.
ISBN:978-0-323-85119-0 ISSN:0065-2458
ForinformationonallAcademicPresspublications visitourwebsiteat https://www.elsevier.com/books-and-journals
Publisher: Zoe Kruze
DevelopmentalEditor: CindyAngelitaGardose ProductionProjectManager: JamesSelvam CoverDesigner: GregHarris TypesetbySTRAIVE,India
Contributorsvii
Preface ix
Tableofabbreviationsxi
1.Introductiontodataprefetching1
PejmanLotfi-KamranandHamidSarbazi-Azad
1. Introduction2
2. Background3
3. Apreliminaryhardwaredataprefetcher9
4. Nonhardwaredataprefetching11
5. Conclusion12 References13 Furtherreading15 Abouttheauthors16
2.Spatialprefetching19
PejmanLotfi-KamranandHamidSarbazi-Azad
1. Introduction19
2. Spatialmemorystreaming(SMS)21
3. Variablelengthdeltaprefetcher(VLDP)23
4. Summary27 References27 Abouttheauthors29
3.Temporalprefetching31
PejmanLotfi-KamranandHamidSarbazi-Azad
1. Introduction31
2. Sampledtemporalmemorystreaming(STMS)33
3. Irregularstreambuffer(ISB)35
4. Summary38 References38 Abouttheauthors41
4.Beyondspatialortemporalprefetching43
PejmanLotfi-KamranandHamidSarbazi-Azad
1. Introduction43
2. Spatiotemporalmemorystreaming(STeMS)44
3. Summary50 References50 Abouttheauthors52
5.State-of-the-artdataprefetchers55
MehranShakerinava,FatemehGolshan,AliAnsari,PejmanLotfi-Kamran, andHamidSarbazi-Azad
1. DOMINO temporaldataprefetcher55
2. BINGO spatialdataprefetcher57
3. Multi-lookaheadoffsetprefetcher59
4. Runaheadmetadata62
5. Summary64 References64 Abouttheauthors65
6.Evaluationofdataprefetchers69
MehranShakerinava,FatemehGolshan,AliAnsari,PejmanLotfi-Kamran, andHamidSarbazi-Azad
1. Introduction70
2. Spatialprefetching70
3. Temporalprefetching73
4. Spatio-temporalprefetching77
5. Offsetprefetching82
6. Multi-degreeprefetchingwithpairwise-correlatingprefetchers83
7. Summary85 References85 Abouttheauthors87
Contributors
AliAnsari
SharifUniversityofTechnology,Tehran,Iran
FatemehGolshan
SharifUniversityofTechnology,Tehran,Iran
PejmanLotfi-Kamran
SchoolofComputerScience,InstituteforResearchinFundamentalSciences(IPM),Tehran, Iran
HamidSarbazi-Azad
SharifUniversityofTechnology,andInstituteforResearchinFundamentalSciences(IPM), Tehran,Iran
MehranShakerinava
SharifUniversityofTechnology,Tehran,Iran
Introductiontodataprefetching
Formanyyears,computerdesignersbenefittedfromMoore’slawtosignificantly improvethespeedofprocessors.Unlikeprocessors,withthemainmemory,thefocus ofimprovementhasbeencapacityandnotaccesstime.Hence,thereisalargegap betweenthespeedofprocessorsandmainmemory.Tohidethegap,ahierarchyof cacheshasbeenusedbetweenaprocessorandmainmemory.Whilecachesproved tobequiteeffective,theireffectivenessdirectlydependsonhowmanytimesa requestedpieceofdatacanbefoundinthecache.Duetothecomplexityofdataaccess patterns,onmanyoccasions,therequestedpieceofdatacannotbefoundinthecache hierarchy,whichexposeslargedelaystoprocessorsandsignificantlydegradestheirperformance.Dataprefetchingisthescienceandartofpredicting,inadvance,whatpieces ofdataaprocessorneedsandbringingthemintothecachebeforetheprocessor requeststhem.Inthischapter,wejustifytheimportanceofdataprefetching,lookat ataxonomyofdataprefetching,andbrieflydiscusssomesimplewaystodoprefetching.
WearelivinginanerainwhichInformationTechnology(IT)isshapingoursociety.Morethananytimeinhistory,oursocietyisdependentonIT foritsday-to-dayactivities.Education,media,science,socialnetworking,etc. areallaffectedbyIT [1].Thesteadygrowthinprocessorperformanceisoneof thedrivingforcesbehindthesuccessandwidespreadadoptionofIT.
Historically,theprocessorperformancehadbeenimprovedbyafactor closetotwoevery2years(thisphenomenonsometimesmistakenlyis referredtoasMoore’slaw [2]).Whileprocessorperformancewasrapidly improving, thegoalofdesignerswastorapidlyincreasethecapacityof theDRAM,whichisthebuildingblockofmainmemory.Asthespeed ofDRAMwasasecond-leveloptimizationfordesigners,overtime,there emergedaconsiderablegapbetweenthespeedofprocessorsandmain memorytowhichmanyreferas memorywall.
Since2004,therateatwhichtheprocessorperformanceisimprovinghas beenreduced [3].Duetothereductionintheannualimprovementinprocessor performance,thememorywallisnotwideningasinthelasttwo decadesofthe20thcentury.Nevertheless,thememorywallisconsiderable andhurtsprocessorperformance.
Tohidethedelayofthemainmemory,historically,acacheisused betweentheprocessorandthemainmemory.Acacheisasmallandfast hardware-managedstoragebuildfromSRAM,whichismuchfasterand requiresmorespacethanDRAM.Ascachesizeisverysmall,onlypartof thedatacanbestoredinthecacheatanygivenpointintime.Whenaprocessorasksforapieceofdata,ifitisinthecache(calledahit),itcanbedeliveredtotheprocessorveryquickly.However,ifthedataisnotinthecache (calledamiss),wehavetogetthedataelsewhere(e.g.,mainmemory).In suchcases,notonlythecacheisnotusefulbutalsoitincreasestheaccess latencyaswefirstneedtosearchthecacheandthenlookforthepieceof dataelsewhere.
Toincreasethelikelihoodoffindingthedatainthecache,contemporary processorsbenefitfromahierarchyofcaches.Theseprocessorsusemultiple levelsofcaches.Thecachethatisclosesttotheprocessoristhesmallestand thefastest.WeusuallyrefertothiscacheastheL1cache.Aswefurtheraway fromtheprocessor,thecachesbecomelargerandslower.Werefertothese cachesasL2,L3,etc.Thecacheattheendofthecachehierarchy,whichis closesttothemainmemory,isusuallyreferredtoasthe lastlevelcache (LLC).
Thecachehierarchyisquiteeffectiveandservesmanyoftheprocessor memoryrequests.Nonetheless,thecachehierarchycannothandleallofthe memoryrequests.Foreveryrequestthatgoestothemainmemory,theprocessorisexposedtothewholedelayofaccessingmemory.Giventhatthe delayofmainmemoryisconsiderableforaprocessor,amechanismthat lowersthenumberofrequestsservedbythememoryisquiteuseful.
Dataprefetchingisawaytoimprovetheeffectivenessofcachesand lowerthestallcyclesduetoservingprocessorrequestsbythemain memory.Adataprefetcherpredictsthatapieceofdataisgoingtobeused bytheprocessor.Basedonthisprediction,incasethepredicteddataisnotin thecache,thedataprefetcherbringsitintothecache.Ifthepredictionis correct,acachemissisavoided,andtheprocessorfindsthedatainthecache whenitisneeded.Otherwise,auselesspieceofdataisbroughtintothe cache,andpossiblyausefulpieceofdataisevictedfromthecachetoopen roomforthepredicteddata.
Themainpartofadataprefetcherishowtopredictfuturememory accessesofaprocessor.Fortunately,thememoryaccessesofaprocessorusuallyareregularandfollowpatterns(asweseeinthisbook).Bylearninga patternandusingitforprediction,adataprefetchercanmakecorrect predictions.
Therearemanypatternsthatadataprefetchermaylearn,andhence, therearemanydifferenttypesofdataprefetchers.Inthisbook,weintroduce severalimportanttypesofdataprefetchers.Wediscusshowtheyperform dataprefetchingandwhytheyworkasintended.Wealsomentionwhen theyareeffectiveandwhentheyarenot.Inthisbook,wefirstdiscuss thefundamentalconceptsindataprefetchingthenstudyrecent,aswellas classic,hardwaredataprefetchers.Wedescribetheoperationsofeverydata prefetcher,indetail,andshedlightonitsdesigntrade-offs.
2.Background
Inthissection,webrieflyreviewsomebackgroundonhardwaredata prefetching.
2.1Predictingmemoryreferences
Thefirststepindataprefetchingispredictingfuturememoryaccesses. Fortunately,dataaccessesdemonstrateseveraltypesofcorrelationsandlocalities,whichleadtotheformationofpatternsamongmemoryaccesses,from whichdataprefetcherscanpredictfuturememoryreferences.Thesepatterns
emergefromthelayoutofprograms’datastructuresinthememory,andthe algorithmandthehigh-levelprogrammingconstructsthatoperateonthese datastructures.Inthischapter,webrieflymentionthreeimportantmemory accesspatternsofapplications:(1)stride,(2)temporal,and(3)spatialaccess patterns.
2.1.1Strideaccesses
Strideaccesspatternreferstoasequenceofmemoryaccessesinwhichthe distanceofconsecutiveaccessesisconstant,e.g.,{A,A+k,A+2k, …}with stridek.Suchpatternsarefrequentinprogramswithdensematricesandfrequentlycomeintosightwhenprogramsoperateonmulti-dimensional arrays.Pleaseconsiderthefollowingpieceofcode. Pseudocode1.Asimpleprogramthatcalculatesthesumofanarrayofbytes.
byteA[100] //somecomputationwithA ...
SUM ¼ 0 for(inti ¼ 0;i < 100;i++)
SUM+¼ A[i]
Inthisexample,allelementsofarray A arereadtobeaddedtothevariable SUM,whichholdstheaggregatesumofarray A.Asallelementsofan arrayareplacedoneafteranotherinmemory,if A[0] isplacedataddress addr, A[1] isplacedat addr+1.Similarly,therestoftheelementsareplacedat addr +2, addr+3, …, addr+99.Duetothisparticularplacementofarray A inthe memory,andthenatureofthealgorithm, Pseudocode1,duringexecution, generates memoryreferences addr,addr+1, …, and addr+99,andhence exhibitsasimplestrideaccesspatternwithastrideof1.Ifwereplace the“for(inti ¼ 0;i < 100;i++)”statementinthecodeto“for(inti ¼ 0; i < 100;i+ ¼ 2),”thecodereferencesaddresses addr, addr+2, addr+4, …,0 addr+98,andhenceexhibitsanaccesspatternwithastrideof2.
Whilestrideaccessesareabundantinarrayandmatrixdatastructures, theyarenotuniquetosuchdatastructures.Strideaccessesalsoappearin pointer-baseddatastructureswhenmemoryallocatorsarrangetheobjects sequentiallyandinaconstant-sizemannerinthememory [1].
2.1.2 Temporaladdresscorrelation
Temporaladdresscorrelation [2] referstoasequenceofaddressesthatfavor being accessedtogetherandinthesameorder.Forexample,ifweobserve
{A,B,C,D},thenitislikelyfor{B,C,D}tofollow{A}inthefuture. Temporaladdresscorrelationstemsfundamentallyfromthefactthat programsconsistofloops,andisobservedwhendatastructuressuchaslists, arrays,andlinklistsaretraversed.Whendatastructuresarestable [4],access patterns recur,andthetemporaladdresscorrelationismanifested [2].Please consider thefollowingpieceofcode.
Pseudocode2.Asimpleprogramthatcalculatesthesumofalinkedlist ofbytes.
structELEMENT_T{byteB;pointerP};
ELEMENT_T * e; //createandmanupulatealinklistofELEMENT_Tthatepointsto ...
ELEMENT_T * p ¼ e; while(p!¼ NULL)
SUM+¼ p.B; p ¼ p.pointer
Thefunctionalityof Pseudocode2 issimilarto Pseudocode1.Themajor difference isduetoreplacingthearrayin Pseudocode1 withalinkedlist.As every elementofalinkedlistcanbeanywhereinthememory,thereis usuallynomanifestationofastrideaccesspatternwhenthecodeisbeing executed.However,ifwetraversethelinkedlistmultipletime,e.g.,the codemakessomemodificationinthelinkedlistandthenattemptstodeterminethetotalsum,thesequenceofmemoryaccessesduetolinkedintraversalwillbethesameastheprevioustimeexceptfordifferencesdueto additionordeletionofcertainelementsinthelinkedlist.Ifthechangein thelinkedlistisnotsignificant,mostofthesequenceofmemoryreferences areidenticalacrossmultipletraversalsofthelinkedlist,whichleadto temporalaccesspatternmanifestation.
2.1.3Spatialaddresscorrelation
Spatialaddresscorrelation [3] referstothephenomenonthatsimilaraccess patterns occurindifferentregionsofmemory.Forexample,ifaprogram visitslocations{A,B,C,D}ofpageX,itisprobablethatitvisitslocations {A,B,C,D}ofotherpagesaswell.Spatialcorrelationtranspiresbecause applicationsusevariousobjectswitharegularandfixedlayout,andaccesses reappearwhiletraversingdatastructures [3].Tobetterunderstandthe concept, pleaseconsiderthefollowingpieceofcode.
Pseudocode3.Asimpleprogramthatcalculatesthesumoftwoelementsofa linkedlist.
structELEMENT_T{bytea;byteA1[100];byteb;byteA2[20];bytec; pointerp;}
ELEMENT_T * e; //createandmanupulatealinklistofELEMENT_Tthatepointsto
ELEMENT_T * p ¼ e; while(p!¼ NULL).
SUM+¼ c;
SUM+¼ b;
SUM+¼ a; p ¼ p.pointer
Thecodeissimilarto Pseudocode2.Themajordifferenceisthateach element ofthelinkedlistisalargestructureconsistingoftwoarraysandthree scalarvariablesinadditiontothepointer.Thepieceofcode,however,only caresaboutthethreescalarvariablesandattemptstocalculatethesumofall thescalarvariablesinthelinkedlist.Asthedatastructureisalinkedlistand henceallofitselementsmaybeindifferentlocationsinthemainmemory, thecodelikelydoesnotexhibitstrideaccesspatterns.Moreover,unlikethe lastexample,assumethatweplantotraversethelinkedlistonce.Assuch, thecodedoesnotexhibittemporalaccesspatternsaswell.Nonetheless, whileeachelementofthelinkedlistmightbeinanylocationinthemain memory,oncethecodetouchesvariable c,italsotouchesvariables b and a. Moreover,asthesevariablesarelocatedinthemainmemorywithafixed layoutdictatedbytheELEMENT_Tstructure,whenthefirstelementis touched,thelocationoftheothertwocanbeinferred.Thisaccesspattern iscalledaspatialaccesspattern.
2.2Prefetchinglookahead
Prefetchersneedtoissuetimelyprefetchrequestsforthepredictedaddresses. Preferably,aprefetchersendsprefetchrequestswellinadvanceandsupplies enoughstoragefortheprefetchedblockstohidetheentirelatencyofmemory accesses.Anearlyprefetchrequestmaycauseevictingausefulblockfromthe cache,andalateprefetchmaydecreasetheeffectivenessofprefetchinginthat aportionofthelonglatencyofmemoryaccessisexposedtotheprocessor.
Prefetchinglookaheadreferstohowfaraheadofthedemandmissstream theprefetchercansendrequests.Anaggressiveprefetchermayofferahigh prefetchinglookahead(say,8)andissuemanyprefetchrequestsaheadofthe processordemandrequeststohidetheentirelatencyofmemoryaccesses; ontheotherhand,aconservativeprefetchermayofferalowprefetching lookaheadandsendasingleprefetchrequestinadvanceoftheprocessor’s demandtoavoidwastingresources(e.g.,cachestorageandmemorybandwidth).Typically,thereisatrade-offbetweentheaggressivenessofa prefetchingtechniqueanditsaccuracy:makingaprefetchermoreaggressive usuallyleadstocoveringmoredata–miss–inducedstallcyclesbutatthecost offetchingmoreuselessdata.
Somepiecesofpriorworkproposetodynamicallyadjusttheprefetching lookahead [5–7].Basedontheobservationthattheoptimalprefetching degree isdifferentforvariousapplicationsandvariousexecutionphasesof aparticularapplication,aswell,theseapproachesemployheuristicsto increaseordecreasetheprefetchinglookahead.Forexample,SPP [6] monitors theaccuracyofissuedprefetchrequestsandreducetheprefetching lookaheadiftheaccuracybecomessmallerthanapredefinedthreshold.
2.3Locationofdataprefetcher
Prefetchingcanbeemployedtomovethedatafromlowerlevelsof thememoryhierarchytoanyhigherlevel.Priorworkuseddataprefetchers atallcachelevels,fromtheprimarydatacachetothesharedlast-level cache.
Thelocationofadataprefetcherhasaprofoundimpactonitsoverall behavior [8].Aprefetcherinthefirst-levelcachecanobserveallmemory accesses, andhence,canissuehighlyaccurateprefetchrequests,butatthe costofimposinglargestorageoverheadforrecordingthemetadatainformation.Incontrast,aprefetcherinthelast-levelcacheobservestheaccess sequencesthathavebeenfilteredathigherlevelsofthememoryhierarchy, resultinginlowerpredictionaccuracy,buthigherstorageefficiency.
2.4Prefetchinghazards
Anaivedeploymentofadataprefetchernotonlymaynotimprovethe systemperformancebutalsomaysignificantlyharmtheperformanceand energyefficiency [9].Thetwowell-knownmajordrawbacksofdata prefetching are(1)cachepollutionand(2)off-chipbandwidthoverhead.
2.4.1Cachepollution
Dataprefetchingmayincreasethedemandmissesbyreplacingusefulcache blockswithuselessprefetcheddata,harmingtheperformance.Cachepollutionusuallyoccurswhenanaggressiveprefetcherexhibitslowaccuracy and/orwhenprefetchrequestsofacoreinamany-coreprocessorcompete forsharedresourceswithdemandaccessesofothercores [10].
2.4.2 Bandwidthoverhead
Inamany-coreprocessor,prefetchrequestsofacorecandelaydemand requestsofanothercorebecauseofcontendingformemorybandwidth [10].Thisinterferenceisthemajorobstacleofusingdataprefetchersin many-core processors,andtheproblemgetsthornierasthenumberofcores increases [11,12].
2.4.3 Placingprefetcheddata
Dataprefetchersusuallyplacetheprefetcheddataintooneofthefollowing twostructures:(1)thecacheitself,and(2)anauxiliarybuffernexttothe cache.Incaseanauxiliarybufferisusedfortheprefetcheddata,demand requestsfirstlookforthedatainthecache;ifthedataisnotfound,theauxiliarybufferissearchedbeforesendingarequesttothelowerlevelsofthe memoryhierarchy.
Storingtheprefetcheddataintothecachelowersthelatencyofaccessing datawhenthepredictioniscorrect.However,whenthepredictionisincorrectorwhentheprefetchrequestisnottimely(i.e.,tooearly),havingthe prefetcheddatainthecachemayresultinevictingusefulcacheblocks.
2.5Prefetchertypes
Thereareseveraltypesofdataprefetchers.Ataveryhighlevel,dataprefetcherscanbeclassifiedintohardwareprefetchersandnonhardware prefetchers.Ahardwareprefetcherisadataprefetchingtechniquethatis implementedasahardwarecomponentinaprocessor.Anyotherprefetching techniqueisanonhardwareprefetcher. Fig.1 showsaclassificationofdata prefetchingtechniques.
Wefocusonhardwaredataprefetchingtechniquesinthisbook.As shownin Fig.1,Hardwaredataprefetcherscanbeclassifiedintospatial, temporal, andnonspatial-temporalprefetchers.Wecoverconventional spatialprefetchersinchapter“Spatialprefetching”byLotfi-Kamranand Sarbazi-Azadandstate-of-the-artspatialprefetchersinchapter“State-ofthe-artdataprefetchers”byShakerinavaetal.Conventionalandstate-of-
2 & 5Chapters 3 & 5Chapter
Fig.1 Aclassificationofvariousdataprefetchingtechniques(withanemphasison hardware dataprefetchers).
the-arttemporalprefetchersarecoveredinchapters“Temporal prefetching”byLotfi-KamranandSarbazi-Azad;“State-of-the-artdata prefetchers”byShakerinavaetal.,respectively.Non-spatial-temporalprefetchersarecoveredinchapter“Beyondspatialortemporalprefetching”by Lotfi-KamranandSarbazi-Azad.Weevaluatevarioustypesofhardwaredata prefetchersinchapter“Evaluationofdataprefetchers”byShakerinavaetal. toempiricallyassesstheirstrengthsandweaknesses.
Fig.1 alsoshowsaclassificationofthenon-hardwaredataprefetchers.As this bookisabouthardwaredataprefetching,wedonotcovernonhardwaredataprefetchingtechniques.However, Section5 ofthischapter reviews variousnon-hardwaredataprefetchingtechniquesandoffersashort explanationforthem.
3.Apreliminaryhardwaredataprefetcher
Togiveinsightonhowastereotypeoperates,nowwedescribea preliminary-yet-prevalenttypeofstrideprefetching.Generally,strideprefetchersarewidelyusedincommercialprocessors(e.g.,IBMPower4 [13],IntelCore [14],AMDOpteron [15],SunUltraSPARCIII [16]) and havebeenshownquiteeffectivefordesktopandengineeringapplications.Strideprefetchers [4,17–23] detectstreams(i.e.,thesequenceofconsecutive addresses)thatexhibitstrideaccesspatternsandgenerateprefetch requestsbyaddingthedetectedstridetothelastobservedaddress.
Instruction-BasedStridePrefetcher(IBSP) [17] isapreliminarytypeof stride prefetching.Theprefetchertracksthestridestreamsonaperload instructionbasis:theprefetcherobservesaccessesissuedbyindividualload instructionsandsendsprefetchrequestsiftheaccessesmanifestastridepattern. Fig.2 showstheorganizationofIBSP’smetadatatable,named
Fig.2 TheorganizationofInstruction-BasedStridePrefetcher(IBSP).The ‘RPT’ keeps track ofvariousstreams.
ReferencePredictionTable(RPT).RPTisastructuretaggedandindexed withtheprogramcounter(PC)ofloadinstructions.EachentryintheRPT correspondstoaspecificloadinstruction;itkeepstheLastBlockreferenced bytheinstructionandtheLastStrideobservedinthestream(i.e.,the distanceoftwolastaddressesaccessedbytheinstruction).
Uponeachtriggeraccess(i.e.,acachemissoraprefetchhit),theRPTis searchedwiththePCoftheinstruction.Ifthesearchresultsinamiss,it meansthatnohistorydoesexistfortheinstruction,andhence,noprefetch requestcanbeissued.Undertwocircumstances,asearchmayresultina miss:(1)wheneveraloadinstructionisanewoneintheexecutionflow oftheprogram,andergo,nohistoryhasbeenrecordedforitsofar,and (2)wheneveraloadinstructionisre-executedafteralongtime,andthe correspondingrecordedmetadatainformationhasbeenevictedfromthe RPTduetoconflicts.Insuchcaseswhennomatchingentrydoesexist intheRPT,anewentryisallocatedfortheinstruction,andpossiblyavictim entryisevicted.ThenewentryistaggedwiththePC,andtheLastBlock fieldoftheentryisfilledwiththereferencedaddress.TheLastStrideisalso settozero(aninvalidvalue)asnostridehasyetbeenobservedforthis stream.However,ifsearchingtheRPTresultsinahit,itmeansthatthere isarecordedhistoryfortheinstruction.Inthiscase,therecordedhistory informationischeckedwiththecurrentaccesstofindoutwhetherornot thestreamisastrideone.Todoso,thedifferenceofthecurrentaddress andtheLastBlockiscalculatedtogetthecurrentstride.Then,thecurrent strideischeckedagainsttherecordedLastStride.Iftheydonotmatch,itis impliedthatthestreamdoesnotexhibitastrideaccesspattern.However,if theymatch,itisconstruedthatthestreamisastrideoneasthreeconsecutive accesseshaveproducedtwoidenticalstrides.Inthiscase,basedonthe lookaheadoftheprefetcher(Section2.2),severalprefetchrequestscanbe issued byconsecutivelyaddingtheobservedstridetotherequestedaddress.
Forexample,ifthecurrentaddressandstrideareAandk,respectively,and thelookaheadofprefetchingisthree,prefetchcandidateswillbe{A+k, A+k+k,A+k+k+k}.Finally,regardlessofthefactthatthestreamisstride ornot,thecorrespondingRPTentryisupdated:theLastBlockisupdated withthecurrentaddress,andtheLastStridetakesthevalueofthecurrent stride.
4.Nonhardwaredataprefetching
Progressintechnologyfabricationaccompaniedbycircuit-leveland microarchitecturaladvancementshasbroughtaboutsignificantenhancementsintheprocessors’performanceoverthepastdecades.Meanwhile, theperformanceofmemorysystemshasnotimprovedinpastewiththat oftheprocessors,formingalargegapbetweentheperformanceofprocessors andmemorysystems [13–17].Asaconsequence,numerousapproaches have beenproposedtoenhancetheexecutionperformanceofapplications bybridgingtheprocessor-memoryperformancegap.Hardwaredata prefetchingisjustoneoftheseapproaches.Hardwaredataprefetchingbridgesthegapbyproactivelyfetchingthedataaheadofthecores’requeststo eliminatetheidlecyclesinwhichtheprocessoriswaitingfortheresponseof thememorysystem.Inthissection,webrieflyreviewtheotherapproaches thattargetthesamegoal(i.e.,bridgingtheprocessor-memoryperformance gap)butinotherways.
Multithreading [18] enablestheprocessortobetterutilizeitscomputationalresources,asstallsinonethreadcanbeoverlappedwiththeexecution ofotherthreads [4,19,20].Multithreading,however,onlyimprovesthroughputanddoesnothingfor(orevenworsens)theresponsetime [15,21–23], whichiscrucialforsatisfyingthestrictlatencyrequirementsofserver applications.
Thread-BasedPrefetchingtechniques [24–28] exploitidlethreadcontexts ordistinctpre-executionhardwaretodrivehelperthreadsthattry tooverlapthecachemisseswithspeculativeexecution.Suchhelperthreads, formedeitherbythehardwareorbythecompiler,executeapieceofcode thatprefetchesforthemainthread.Nonetheless,theadditionalthreadsand fetch/executionbandwidthmaynotbeavailablewhentheprocessorisfully utilized.Theabundantrequest-levelparallelismofserverapplications [22–24] makessuchschemesineffectiveinthatthehelperthreadsneedto compete withthemainthreadsforthehardwarecontext.
RunaheadExecution [29–31] makestheexecutionresourcesofacore that wouldotherwisebestalledonanoff-chipcachemisstogoaheadof thestalledexecutioninanattempttodiscoveradditionalloadmisses. Similarly,BranchPredictionDirectedPrefetching [5] utilizesthebranch predictor toruninadvanceoftheexecutingprogram,therebyprefetching loadinstructionsalongtheexpectedfuturepath.Suchapproaches,nevertheless,areconstrainedbytheaccuracyofthebranchpredictorandcan coversimplyaportionofthemisslatency,sincetherunaheadthread/branch predictormaynotbecapableofexecutingfaraheadinadvanceto completelyhideacachemiss.Moreover,theseapproachescanonlyprefetch independentcachemisses [32] andmaynotbeeffectiveformanyofthe server workloads,e.g.,OLTPandWebapplications,thatarecharacterized bylongchainsofdependentmemoryaccesses [2,33]
On thesoftwareside,thereareeffortstorestructureprogramstoboost chip-levelDataSharingandDataReuse [34,35] inordertodecreaseoff-chip accesses. Whilethesetechniquesareusefulforworkloadswithmodest datasets,theyfallshortofefficiencyforbig-dataserverworkloads,where themultigigabyteworkingsetsofworkloadsdwarfthefewmegabytesof on-chipcachecapacity.Theevergrowingdatasetsofserverworkloadsmake suchapproachesunscalable.SoftwarePrefetchingtechniques [36–41] profile the programcodeandinsertprefetchinstructionstoeliminatecachemisses. Whilethesetechniquesareshowntobebeneficialforsmallbenchmarks, theyusuallyrequiresignificantprogrammerefforttoproduceoptimized codetogeneratetimelyprefetchrequests.
Memory-SidePrefetchingtechniques [42–44] placethehardwarefor data prefetchingnearDRAM,forthesakeofsavingpreciousSRAMbudget.Insuchapproaches(e.g., [43]),prefetchingisperformedbyauserthread running neartheDRAM,andprefetchedpiecesofdataaresenttothe on-chipcaches.Unfortunately,suchtechniqueslosethepredictabilityof corerequests [45] andareincapableofperformingcache-leveloptimizations (e.g., avoidingcachepollution [7]).
5.Conclusion
Datacachemissesareamajorsourceofperformancedegradationin applications.Dataprefetchingisawidely-usedtechniqueforreducingthe numberofdatacachemissesortheirnegativeeffects.Dataprefetchersusuallybenefitfromcorrelationsandlocalitiesamongdataaccessestopredict futurememoryreferences.Asthereexistseveraltypesofcorrelationsamong
dataaccesses,thereareseveraltypesofdataprefetchers.Inthisbook,we introduceseveralimportantclassesofdataprefetchersandhighlighttheir strengthsandweaknesses.
References
[1]K.R.Lee,ImpactsofInformationTechnologyonSocietyintheNewCentury,2001. https://www.zurich.ibm.com/pdf/news/Konsbruck.pdf .
[2] G.E.Moore,Crammingmorecomponentsontointegratedcircuits,Electronics 38 (8)(1965)114–117.
[3] D.Geer,Chipmakersturntomulticoreprocessors,Computertomographie38(5) (2005) 11–13.
[4] J.-L.Baer,T.-F.Chen,Aneffectiveon-chippreloadingschemetoreducedataaccess penalty, in: ProceedingsoftheACM/IEEEConferenceonSupercomputing,1991.
[5] F.Dahlgren,P.Stenstrom,Hardware-basedstrideandsequentialprefetchingin shared-memory multiprocessors,in: ProceedingsoftheInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),1995.
[6] P.Lotfi-Kamran,H.Sarbazi-Azad,M.Bakhshalipour,Dominotemporaldata prefetcher, in: ProceedingsoftheInternationalSymposiumonHigh-PerformanceComputer Architecture(HPCA),2018.
[7] M.Shakerinava,P.Lotfi-Kamran,H.Sarbazi-Azad,M.Bakhshalipour,Bingospatial data prefetcher,in: ProceedingsoftheInternationalSymposiumonHigh-Performance ComputerArchitecture(HPCA),2019.
[8] J.Kim,P.Sharma,R.Panda,P.Gratz,D.Jimenez,D.Kadjo,B-fetch:branchprediction directedprefetchingforChip-multiprocessors,in: ProceedingsoftheInternational SymposiumonMicroarchitecture(MICRO),2014.
[9] Z.Fang,A.Zhai,P.Yew,S.Mehta,Multi-stagecoordinatedprefetchingfor present-day processors,in: ProceedingsoftheInternationalConferenceonSupercomputing (ICS),2014.
[10] O.Mutlu,H.Kim,Y.N.Patt,S.Srinath,Feedbackdirectedprefetching:improvingthe performance andbandwidth-efficiencyofhardwareprefetchers,in: Proceedingsofthe InternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),2007.
[11] S.H.Pugsley,P.V.Gratz,A.L.NarasimhaReddy,C.Wilkerson,Z.Chishti,J.Kim,Path confidencebasedlookaheadprefetching,in: ProceedingsoftheInternationalSymposiumon Microarchitecture(MICRO),2016.
[12] P.Lotfi-Kamran,A.Mazloumi,F.Samandi,M.Naderan-Tahan,M.Modarressi,S.-A. Hamid, M.Bakhshalipour,Fastdatadeliveryformany-Coreprocessors,IEEETrans. Comput67(10)(2018)1416–1429.
[13] O.Mutlu,C.J.Lee,Y.N.Patt,E.Ebrahimi,Coordinatedcontrolofmultipleprefetchers inmulti-Coresystems,in: ProceedingsoftheInternationalSymposiumonMicroarchitecture (MICRO),2009.
[14] C.J.Lee,O.Mutlu,Y.N.Patt,E.Ebrahimi,Fairnessviasourcethrottling:Aconfigurable andhigh-performancefairnesssubstrateformulti-Corememorysystems, in: ProceedingsoftheInternationalConferenceonArchitecturalSupportforProgramming LanguagesandOperatingSystems(ASPLOS),2010.
[15] O.DongsuHan,M.H.-B.Mutlu,Y.Kim,ATLAS:Ascalableandhigh-performance scheduling algorithmformultiplememorycontrollers,in: ProceedingsoftheInternational SymposiumonHigh-PerformanceComputerArchitecture(HPCA),2010.
[16] J.SteveDodson,J.Fields,H.Q.Le,B.Sinharoy,J.M.Tendler,POWER4system microarchitecture, IBMJ.Res.Develop.46(1)(2002)5–25.