Computer Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David by Luna James

Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

Chapter1Solutions

CaseStudy1:ChipFabricationCost

1.1a. Yield ¼ 1/(1+(0.04 2))14 ¼ 0.34

b. Itisfabricatedinalargertechnology,whichisanolderplant.Asplantsage, theirprocessgetstuned,andthedefectratedecreases.

1.2a. Phoenix: Diesperwafer ¼

¼ 1= 1+0 04 2 ðÞ ðÞ14 ¼ 0 340 Profit ¼ 724 0 34 30 ¼ $7384 80

b. RedDragon: Diesperwafer ¼

c. Phoenixchips:25,000/724 ¼ 34.5wafersneeded RedDragonchips:50,000/1234 ¼ 40.5wafersneeded Therefore,themostlucrativesplitis40RedDragonwafers,30Phoenixwafers.

1.3a. Defect-freesinglecore ¼ Yield ¼ 1/(1+(0.04 0.25))14 ¼ 0.87 Equationfortheprobabilitythat N aredefectfreeonachip: #combinations (0.87)N (1 0.87)8 N

YieldforPhoenix4:(0.39+0.21+0.06+0.01) ¼ 0.57 YieldforPhoenix2:(0.001+0.0001) ¼ 0.0011 YieldforPhoenix1:0.000004

b. ItwouldbeworthwhiletosellPhoenix4.However,theothertwohavesucha lowprobabilityofoccurringthatitisnotworthsellingthem.

Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

$20 ¼ Wafersize odddpw 0 28

Step1:DeterminehowmanyPhoenix4chipsareproducedforevery Phoenix8chip.

Thereare57/33Phoenix4chipsforeveryPhoenix8chip ¼ 1.73

$30+1:73 $25 ¼ $73:25

CaseStudy2:PowerConsumptioninComputerSystems

1.4a. Energy:1/8.Power:Unchanged.

b. Energy:Energynew/Energyold ¼ (Voltage 1/8)2/Voltage2 ¼ 0.156

Power:Powernew/Powerold ¼ 0.156 (Frequency 1/8)/Frequency ¼ 0.00195

c. Energy:Energynew/Energyold ¼ (Voltage 0.5)2/Voltage2 ¼ 0.25

Power:Powernew/Powerold ¼ 0.25 (Frequency 1/8)/Frequency ¼ 0.0313

d. 1core ¼ 25%oftheoriginalpower,runningfor25%ofthetime. 0:25 0:25+0:25 0:2 ðÞ 0:75 ¼ 0:0625+0:0375 ¼ 0:1

1.5a. Amdahl’slaw:1/(0.8/4+0.2) ¼ 1/(0.2+0.2) ¼ 1/0.4 ¼ 2.5

b. 4cores,eachat1/(2.5)thefrequencyandvoltage Energy:Energyquad/Energysingle ¼ 4 (Voltage 1/(2.5))2/Voltage2 ¼ 0.64

Power:Powernew/Powerold ¼ 0.64 (Frequency 1/(2.5))/Frequency ¼ 0.256

c. 2cores+2ASICsvs.4cores

1.6a. WorkloadAspeedup:225,000/13,461 ¼ 16.7 WorkloadBspeedup:280,000/36,465 ¼ 7.7 1/(0.7/16.7+0.3/7.7)

b. General-purpose:0.70 0.42+0.30 ¼ 0.594

GPU:0.70 0.37+0.30 ¼ 0.559 TPU:0.70 0.80+0.30 ¼ 0.886

c. General-purpose:159W+(455W 159W) 0.594 ¼ 335W

GPU:357W+(991W 357W) 0.559 ¼ 711W

TPU:290W+(384W 290W) 0.86 ¼ 371W

Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

Chapter1Solutions ■ 3

GPU:1/(0.4/2.46+0.1/2.76+0.5/1.25) ¼ 1.67

TPU:1/(0.4/41+0.1/21.2+0.5/0.17) ¼ 0.33

e. General-purpose:14,000/504 ¼ 27.8 28

GPU:14,000/1838 ¼ 7.62 8

TPU:14,000/861 ¼ 16.3 17

d. General-purpose:2200/504 ¼ 4.37 4,14,000/(4 504) ¼ 6.74 7

GPU:2200/1838 ¼ 1.2 1,14,000/(1 1838) ¼ 7.62 8

TPU:2200/861 ¼ 2.56 2,14,000/(2 861) ¼ 8.13 9

Exercises

1.7a. Somewherebetween1.410 and1.5510,or28.9 80x

b. 6043in2003,52%growthrateperyearfor12yearsis60,500,000(rounded)

c. 24,129in2010,22%growthrateperyearfor15yearsis1,920,000(rounded)

d. Multiplecoresonachipratherthanfastersingle-coreperformance

e. 2 ¼ x 4 , x ¼ 1.032,3.2%growth

1.8a. 50%

b. Energy:Energynew/Energyold ¼ (Voltage 1/2)2/Voltage2 ¼ 0.25

1.9a. 60%

b. 0.4+0.6 0.2 ¼ 0.58,whichreducestheenergyto58%oftheoriginalenergy

c. newPower/oldPower ¼ ½Capacitance (Voltage 0.8)2 (Frequency 0.6)/½ Capacitance Voltage Frequency ¼ 0.82 0.6 ¼ 0.256oftheoriginalpower.

d. 0.4+0.3 2 ¼ 0.46,whichreducestheenergyto46%oftheoriginalenergy

1.10a. 109/100 ¼ 107

b. 107/107 +24 ¼ 1

c. [needsolution]

1.11a. 35/10,000 3333 ¼ 11.67days

b. Thereareseveralcorrectanswers.Onewouldbethat,withthecurrentsystem, onecomputerfailsapproximatelyevery5min.5minisunlikelytobeenough timetoisolatethecomputer,swapitout,andgetthecomputerbackonline again.10min,however,ismuchmorelikely.Inanycase,itwouldgreatly extendtheamountoftimebefore1/3ofthecomputershavefailedatonce. Becausethecostofdowntimeissohuge,beingabletoextendthisisvery valuable.

c. $90,000 ¼ (x + x + x +2x)/4

$360,000 ¼ 5x

$72,000 ¼ x 4thquarter ¼ $144,000/h

Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

FigureS.1 Plotoftheequation: y 5 100/((100 x)+ x/10).

1.12a. SeeFigureS.1.

b. 2 ¼ 1/((1 x)+ x/20) 10/19 ¼ x ¼ 52.6%

c. (0.526/20)/(0.474+0.526/20) ¼ 5.3%

d. Extraspeedupwith2units:1/(0.1+0.9/2) ¼ 1.82.1.82 20 36.4. Totalspeedup:1.95.Extraspeedupwith4units:1/(0.1+0.9/4) ¼ 3.08. 3.08 20 61.5.Totalspeedup:1.97

1.13a. oldexecutiontime ¼ 0.5new+0.5 10new ¼ 5.5new

b. Intheoriginalcode,theunenhancedpartisequalintimetotheenhancedpart (spedupby10),therefore:

(1 x) ¼ x/10

10 10x ¼ x 10 ¼ 11x 10/11 ¼ x ¼ 0.91

1.14a. 1/(0.8+0.20/2) ¼ 1.11

b. 1/(0.7+0.20/2+0.10 3/2) ¼ 1.05

c. fpops:0.1/0.95 ¼ 10.5%,cache:0.15/0.95 ¼ 15.8%

1.15a. 1/(0.5+0.5/22) ¼ 1.91

b. 1/(0.1+0.90/22) ¼ 7.10

c. 41% 22 ¼ 9.Arunson9cores.SpeedupofAon9cores:1/(0.5+0.5/9) ¼ 1.8Overallspeedupif9coreshave1.8speedup,othersnone:1/(0.6+0.4/1.8) ¼ 1.22

d. Calculatevaluesforallprocessorslikeinc.Obtain:1.8,3,1.82,2.5, respectively.

e. 1/(0.41/1.8+0.27/3+0.18/1.82+0.14/2.5) ¼ 2.12

1.16a. 1/(0.2+0.8/N)

b. 1/(0.2+8 0.005+0.8/8) ¼ 2.94

c. 1/(0.2+3 0.005+0.8/8) ¼ 3.17

d. 1/(.2+logN 0.005+0.8/N) e. d/dN (1/((1 P)+logN 0.005+ P/N) ¼ 0)

Architecture -

A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

Chapter2Solutions

CaseStudy1:OptimizingCachePerformancevia AdvancedTechniques

2.1a. Eachelementis8B.Becausea64Bcachelinehas8elements,andeachcolumn accesswillresultinfetchinganewlineforthenonidealmatrix,weneedaminimumof8 8(64elements)foreachmatrix.Hence,theminimumcachesizeis 128 8B ¼ 1KB.

b. Theblockedversiononlyhastofetcheachinputandoutputelementonce.The unblockedversionwillhaveonecachemissforevery64B/8B ¼ 8rowelements.Eachcolumnrequires64B 256ofstorage,or16KB.Thus,column elementswillbereplacedinthecachebeforetheycanbeusedagain.Hence, theunblockedversionwillhave9misses(1rowand8columns)forevery2 intheblockedversion.

c. for(i=0;i < 256; i=i+B){ for(j=0;j < 256;j=j+B){ for(m=0;m < B;m++){ for(n=0;n < B;n++){ output[j+n][i+m]=input[i+m][j+n];

d. 2-waysetassociative.Inadirect-mappedcache,theblockscouldbeallocated sothattheymaptooverlappingregionsinthecache.

e. Youshouldbeabletodeterminethelevel-1cachesizebyvaryingtheblock size.Theratiooftheblockedandunblockedprogramspeedsforarraysthat donotfitinthecacheincomparisontoblocksthatdoisafunctionofthecache blocksize,whetherthemachinehasout-of-orderissue,andthebandwidth providedbythelevel-2cache.Youmayhavediscrepanciesifyourmachine hasawrite-throughlevel-1cacheandthewritebufferbecomesalimiterof performance.

2.2 Becausetheunblockedversionistoolargetofitinthecache,processingeight8B elementsrequiresfetchingone64Browcacheblockand8columncacheblocks. Becauseeachiterationrequires2cycleswithoutmisses,prefetchescanbeinitiated every2cycles,andthenumberofprefetchesperiterationismorethanone,the memorysystemwillbecompletelysaturatedwithprefetches.Becausethelatency ofaprefetchis16cycles,andonewillstartevery2cycles,16/2 ¼ 8willbeoutstandingatatime.

2.3 Openhands-onexercise,nofixedsolution

Architecture - A

Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

CaseStudy2:PuttingitallTogether:HighlyParallel MemorySystems

2.4a. Thesecond-levelcacheis1MBandhasa128Bblocksize.

b. Themisspenaltyofthesecond-levelcacheisapproximately105ns.

c. Thesecond-levelcacheis8-waysetassociative.

d. Themainmemoryis512MB.

e. Walkingthroughpageswitha16Bstridetakes946nsperreference.With250 suchreferencesperpage,thisworksouttoapproximately240msperpage.

2.5a. Hint:ThisisvisibleinthegraphaboveshownasaslightincreaseinL2miss servicetimeforlargedatasets,andis4KBforthegraphabove.

b. Hint:Takeindependentstridesbythepagesizeandlookforincreasesinlatency notattributabletocachesizes.Thismaybehardtodiscerniftheamountof memorymappedbytheTLBisalmostthesameasthesizeasacachelevel.

c. Hint:ThisisvisibleinthegraphaboveshownasaslightincreaseinL2miss servicetimeforlargedatasets,andis15nsinthegraphabove.

d. Hint:TakeindependentstridesthataremultiplesofthepagesizetoseeiftheTLB iffully-associativeorset-associative.Thismaybehardtodiscerniftheamountof memorymappedbytheTLBisalmostthesameasthesizeasacachelevel.

2.6a. Hint:Lookatthespeedofprogramsthateasilyfitinthetop-levelcacheasa functionofthenumberofthreads.

b. Hint:Comparetheperformanceofindependentreferencesasafunctionoftheir placementinmemory.

2.7 Openhands-onexercise,nofixedsolution

CaseStudy3:StudyingtheImpactofVariousMemory SystemOrganizations

2.8 Onarowbuffermiss,thetimetakentoretrievea64-byteblockofdataequals tRP+tRCD+CL+transfertime ¼ 13+13+13+4 ¼ 43ns.

2.9 Onarowbufferhit,thetimetakentoretrievea64-byteblockofdataequals CL+transfertime ¼ 13+4 ¼ 17ns.

2.10 Eachrowbuffermissinvolvessteps(i)–(iii)intheproblemdescription.Beforethe Prechargeofanewreadoperationcanbegin,thePrecharge,Activate,andCASlatenciesofthepreviousreadoperationmustelapse.Inotherwords,back-to-backread operationsareseparatedbytimetRP+tRCD+CL ¼ 39ns.Ofthistime,thememory channelisonlyoccupiedfor4ns.Therefore,thechannelutilizationis4/39 ¼ 10.3%.

2.11

Whenareadoperationisinitiatedinabank,39cycleslater,thechannelisbusyfor 4cycles.Tokeepthechannelbusyallthetime,thememorycontrollershouldinitiatereadoperationsinabankevery4cycles.Becausesuccessivereadoperations toasinglebankmustbeseparatedby39cycles,overthatperiod,thememorycontrollershouldinitiatereadoperationstouniquebanks.Therefore,atleast10banks

Architecture - A Quantitative Approach

(6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

arerequiredifthememorycontrollerhopestoachieve100%channelutilization. Itcantheninitiateanewreadoperationtoeachofthese10banksevery4cycles, andthenrepeattheprocess.Anotherwaytoarriveatthisansweristodivide100by the10.3%utilizationcalculatedpreviously.

2.12 Wecanassumethataccessestoabankalternatebetweenrowbufferhitsandmisses. Thesequenceofoperationsisasfollows:Prechargebeginsattime0;Activateisperformedattime13ns;CASisperformedattime26ns;asecondCAS(therowbuffer hit)isperformedattime30ns;aprechargeisperformedattime43ns;andsoon.Inthe abovesequence,wecaninitiatearowbuffermissandrowbufferhittoabankevery 43ns.So,thechannelisbusyfor8outofevery43ns,thatis,autilizationof18.6%. Wewouldthereforeneedatleast6bankstoachieve100%channelutilization.

2.13 Withasinglebank,thefirstrequestwouldbeservicedafter43ns.Thesecond wouldbeserviced39nslater,thatis,attime82ns.Thethirdandfourthwould beservicedattimes121and160ns.Thisgivesusanaveragememorylatency of(43+82+121+160)/4 ¼ 101.5ns.Notethatwaitinginthequeueisasignificant componentofthisaveragelatency.Ifwehadfourbanks,thefirstrequestwouldbe servicedafter43ns.Assumingthatthefourrequestsgotofourdifferentbanks,the second,third,andfourthrequestsareservicedattimes47,51,and55ns.Thisgives usanaveragememorylatencyof49ns.Theaveragelatencywouldbehigheriftwo oftherequestswenttothesamebank.Notethatbyofferingevenmorebanks,we wouldincreasetheprobabilitythatthefourrequestsgotodifferentbanks.

2.14 Asseeninthepreviousquestions,bygrowingthenumberofbanks,wecansupport higherchannelutilization,thusofferinghigherbandwidthtoanapplication.We havealsoseenthatbyhavingmorebanks,wecansupporthigherparallelism andthereforelowerqueuingdelaysandloweraveragememorylatency.Thus,even thoughthelatencyforasingleaccesshasnotchangedbecauseoflimitationsfrom physics,byboostingparallelismwithbanks,wehavebeenabletoimproveboth averagememorylatencyandbandwidth.Thisarguesforamemorychipthatispartitionedintoasmanybanksaswecanmanage.However,eachbankintroduces overheadsintermsofperipherallogicnearthebank.Tobalanceperformance anddensity(cost),DRAMmanufacturershavesettledonamodestnumberof banksperchip 8forDDR3and16forDDR4.

2.15 Whenchannelutilizationishalved,thepowerreducesfrom535to295mW,a45% reduction.Atthe70%channelutilization,increasingtherowbufferhitratefrom 50%to80%movespowerfrom535to373mW,a30%reduction.

2.16 Thetableisasfollows(assumingdefaultsystemconfig):

Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

Notethatthe4GbDRAMchipsaremanufacturedwithabettertechnology,soitis bettertousefewerandlargerchipstoconstructarankwithgivencapacity,ifweare tryingtoreducepower.

Exercises

2.17a. Theaccesstimeofthedirect-mappedcacheis0.86ns,whilethe2-wayand 4-wayare1.12and1.37ns,respectively.Thismakestherelativeaccesstimes 1.12/0.86 ¼ 1.30or30%moreforthe2-wayand1.37/0.86 ¼ 1.59or59%more forthe4-way.

b. Theaccesstimeofthe16KBcacheis1.27ns,whilethe32and64KBare1.35 and1.37ns,respectively.Thismakestherelativeaccesstimes1.35/1.27 ¼ 1.06 or6%largerforthe32KBand1.37/1.27 ¼ 1.078or8%largerforthe64KB.

c. Avg.accesstime ¼ hit% hittime+miss% misspenalty,miss% ¼ missesper instruction/referencesperinstruction ¼ 2.2%(DM),1.2%(2-way),0.33% (4-way),0.09%(8-way).

Directmappedaccesstime ¼ 0.86ns @ 0.5nscycletime ¼ 2cycles 2-waysetassociative ¼ 1.12ns @ 0.5nscycletime ¼ 3cycles 4-waysetassociative ¼ 1.37ns @ 0.83nscycletime ¼ 2cycles 8-waysetassociative ¼ 2.03ns @ 0.79nscycletime ¼ 3cycles

Misspenalty ¼ (10/0.5) ¼ 20cyclesforDMand2-way;10/0.83 ¼ 13cyclesfor 4-way;10/0.79 ¼ 13cyclesfor8-way.

Directmapped (1 0.022) 2+0.022 (20) ¼ 2.396cycles ) 2.396 0.5 ¼ 1.2ns

2-way (1 0.012) 3+0.012 (20) ¼ 3.2cycles ) 3.2 0.5 ¼ 1.6ns

4-way (1 0.0033) 2+0.0033 (13) ¼ 2.036cycles ) 2.06 0.83 ¼ 1.69ns 8-way (1 0.0009) 3+0.0009 13 ¼ 3cycles ) 3 0.79 ¼ 2.37ns

Directmappedcacheisthebest.

2.18a. Theaveragememoryaccesstimeofthecurrent(4-way64KB)cacheis1.69ns. 64KBdirectmappedcacheaccesstime ¼ 0.86ns @ 0.5nscycletime ¼ 2 cyclesWay-predictedcachehascycletimeandaccesstimesimilartodirect mappedcacheandmissratesimilarto4-waycache.

TheAMAToftheway-predictedcachehasthreecomponents:miss,hitwith waypredictioncorrect,andhitwithwaypredictionmispredict:0.0033 (20)+(0.80 2+(1 0.80) 3) (1 0.0033) ¼ 2.26cycles ¼ 1.13ns.

b. Thecycletimeofthe64KB4-waycacheis0.83ns,whilethe64KBdirectmappedcachecanbeaccessedin0.5ns.Thisprovides0.83/0.5 ¼ 1.66or 66%fastercacheaccess.

c. With1cyclewaymispredictionpenalty,AMATis1.13ns(asperparta),but witha15cyclemispredictionpenalty,theAMATbecomes:0.0033 20 +(0.80 2+(1 0.80) 15) (1 0.0033) ¼ 4.65cyclesor2.3ns.

d. Theserialaccessis2.4ns/1.59ns ¼ 1.509or51%slower.

Architecture - A Quantitative Approach (6th Edition) -tions Manual

(John L. Hennessy, David A. Patterson

Chapter2Solutions ■ 5

2.19a. Theaccesstimeis1.12ns,whilethecycletimeis0.51ns,whichcouldbe potentiallypipelinedasfinelyas1.12/0.51 ¼ 2.2pipestages.

b. Thepipelineddesign(notincludinglatchareaandpower)hasanareaof 1.19mm2 andenergyperaccessof0.16nJ.Thebankedcachehasanareaof 1.36mm2 andenergyperaccessof0.13nJ.Thebankeddesignusesslightly moreareabecauseithasmoresenseampsandothercircuitrytosupportthe twobanks,whilethepipelineddesignburnsslightlymorepowerbecausethe memoryarraysthatareactivearelargerthaninthebankedcase.

2.20a. Withcriticalwordfirst,themissservicewouldrequire120cycles.Withoutcriticalwordfirst,itwouldrequire120cyclesforthefirst16Band16cyclesfor eachofthenext316Bblocks,or120+(3 16) ¼ 168cycles.

b. ItdependsonthecontributiontoAverageMemoryAccessTime(AMAT)of thelevel-1andlevel-2cachemissesandthepercentreductioninmissservice timesprovidedbycriticalwordfirstandearlyrestart.Ifthepercentagereductioninmissservicetimesprovidedbycriticalwordfirstandearlyrestartis roughlythesameforbothlevel-1andlevel-2missservice,theniflevel-1misses contributemoretoAMAT,criticalwordfirstwouldlikelybemoreimportant forlevel-1misses.

2.21a. 16B,tomatchthelevel2datacachewritepath.

b. Assumemergingwritebufferentriesare16Bwide.Becauseeachstorecan write8B,amergingwritebufferentrywouldfillupin2cycles.Thelevel-2 cachewilltake4cyclestowriteeachentry.Anonmergingwritebufferwould take4cyclestowritethe8Bresultofeachstore.Thismeansthemergingwrite bufferwouldbetwotimesfaster.

c. Withblockingcaches,thepresenceofmisseseffectivelyfreezesprogressmade bythemachine,sowhethertherearemissesornotdoesn’tchangetherequired numberofwritebufferentries.Withnonblockingcaches,writescanbeprocessedfromthewritebufferduringmisses,whichmaymeanfewerentries areneeded.

2.22 Inallthreecases,thetimetolookuptheL1cachewillbethesame.What differsisthetimespentservicingL1misses.Incase(a),thattime ¼ 100 (L1misses) 16cycles+10(L2misses) 200cycles ¼ 3600cycles.Incase (b),thattime ¼ 100 4+50 16+10 200 ¼ 3200cycles.Incase(c),that time ¼ 100 2+80 8+40 16+10 200 ¼ 3480cycles.Thebestdesigniscase (b)witha3-levelcache.Goingtoa2-levelcachecanresultinmanylongL2 accesses(1600cycleslookingupL2).Goingtoa4-levelcachecanresultinmany futilelook-upsineachlevelofthehierarchy.

2.23 ProgramBenjoysahighermarginalutilityfromeachadditionalway.Therefore, iftheobjectiveistominimizeoverallMPKI,programBshouldbeassigned asmanywaysaspossible.Byassigning15waystoprogramBand1wayto programA,weachievetheminimumaggregateMPKIof50 (14 2) +100 ¼122.

Architecture -

A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

2.24 Let’sfirstassumeanidealizedperfectL1cache.A1000-instructionprogram wouldfinishin1000cycles,thatis,1000ns.Thepowerconsumptionwouldbe 1WforthecoreandL1,plus0.5Wofmemorybackgroundpower.Theenergy consumedwouldbe1.5W 1000ns ¼ 1.5 μJ.

Next,consideraPMDthathasnoL2cache.A1000-instructionprogramwould finishin1000+100(MPKI) 100ns(latencypermemoryaccess) ¼ 11,000ns. Theenergyconsumedwouldbe11,000ns 1.5W(core,L1,backgroundmemory power)+100(memoryaccesses) 35nJ(energypermemoryaccess) ¼ 16,500 +3500nJ ¼ 20 μJ.

ForthePMDwitha256KBL2,the1000-instructionprogramwouldfinishin 1000+100(L1MPKI) 10ns(L2latency)+20(L2MPKI) 100ns(memory latency) ¼ 4000ns.Energy ¼ 1.7W(core,L1,L2background,memorybackgroundpower) 4000ns+100(L2accesses) 0.5nJ(energyperL2access) +20(memoryaccesses) 35nJ ¼ 6800+50+700nJ ¼ 7.55 μJ.

ForthePMDwitha1MBL2,the1000-instructionprogramwouldfinishin1000 +100 20ns+10 100ns ¼ 4000ns.

Energy ¼ 2.3W 4000ns+100 0.7nJ+10 35nJ ¼ 9200+70+350nJ ¼ 9.62 μJ. Therefore,ofthesedesigns,lowestenergyforthePMDisachievedwitha 256KBL2cache.

2.25 (a)Asmallblocksizeensuresthatweareneverfetchingmorebytesthanrequired bytheprocessor.ThisreducesL2andmemoryaccesspower.Thisisslightlyoffset bytheneedformorecachetagsandhighertagarraypower.However,ifthisleads tomoreapplicationmissesandlongerprogramcompletiontime,itmayultimately resultinhigherapplicationenergy.Forexample,inthepreviousexercise,notice howthedesignwithnoL2cacheresultsinalongexecutiontimeandhighest energy.(b)Asmallcachesizewouldlowercachepower,butitincreasesmemory powerbecausethenumberofmemoryaccesseswillbehigher.Asseeninthepreviousexercise,theMPKIs,latencies,andenergyperaccessultimatelydecideif energywillincreaseordecrease.(c)Higherassociativitywillresultinhigher tagarraypower,butitshouldlowerthemissrateandmemoryaccesspower.It shouldalsoresultinlowerexecutiontime,andeventuallylowerapplication energy.

2.26 (a)TheLRUpolicyessentiallyusesrecencyoftouchtodeterminepriority.A newlyfetchedblockisinsertedattheheadoftheprioritylist.Whenablockis touched,theblockisimmediatelypromotedtotheheadoftheprioritylist.When ablockmustbeevicted,weselecttheblockthatiscurrentlyatthetailofthepriority list.(b)Researchstudieshaveshownthatsomeblocksarenottouchedduringtheir residenceinthecache.AnInsertionpolicythatexploitsthisobservationwould insertarecentlyfetchedblocknearthetailoftheprioritylist.ThePromotionpolicy movesablocktotheheadoftheprioritylistwhentouched.Thisgivesablocka longerresidenceinthecacheonlywhenitistouchedatleasttwicewithinashort

Architecture -

A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

Chapter2Solutions ■ 7

window.ItmayalsobereasonabletoimplementaPromotionpolicythatgradually movesablockafewplacesaheadintheprioritylistoneverytouch.

2.27 Thestandardapproachtoisolatingthebehaviorofeachprogramiscachepartitioning.Inthisapproach,eachprogramreceivesasubsetofthewaysintheshared cache.WhenprovidingQoS,thewaysallocatedtoeachprogramcanbedynamicallyvariedbasedonprogrambehaviorandtheservicelevelsguaranteedtoeach program.Whenprovidingprivacy,theallocationofwayshastobedetermined beforehandandcannotvaryatruntime.Tuningtheallocationbasedontheprograms’ currentneedswouldresultininformationleakage.

2.28 ANUCAcacheyieldshigherperformanceifmorerequestscanbeservicedbythe banksclosesttotheprocessor.Notethatwealreadyimplementaprioritylist(typicallybasedonrecencyofaccess)withineachsetofthecache.Thisprioritylistcan beusedtomapblockstotheNUCAbanks.Forexample,frequentlytouchedblocks aregraduallypromotedtolow-latencybankswhileblocksthathaven’tbeentouched recentlyaredemotedtohigher-latencybanks.Notethatsuchblockmigrationswithin thecachewillincreasecachepower.Theymayalsocomplicatecachelook-up.

2.29a. A2GBDRAMwithparityorECCeffectivelyhas9bitbytes,andwould require181GbDRAMs.Tocreate72outputbits,eachonewouldhavetooutput72/18 ¼ 4bits.

b. Aburstlengthof4readsout32B.

c. TheDDR-667DIMMbandwidthis667 8 ¼ 5336MB/s.

TheDDR-533DIMMbandwidthis533 8 ¼ 4264MB/s.

2.30a. Thisissimilartothescenariogiveninthefigure,buttRCDandCLareboth5.In addition,wearefetchingtwotimesthedatainthefigure.Thus,itrequires5+5 +4 2 ¼ 18cyclesofa333MHzclock,or18 (1/333MHz) ¼ 54.0ns.

b. Thereadtoanopenbankrequires5+4 ¼ 9cyclesofa333MHzclock,or 27.0ns.Inthecaseofabankactivate,thisis14cycles,or42.0ns.Including 20nsformissprocessingonchip,thismakesthetwo42+20 ¼ 61nsand27.0 +20 ¼ 47ns.Includingtimeonchip,thebankactivatetakes61/47 ¼ 1.30or 30%longer.

2.31 Thecostsofthetwosystemsare $2 130+ $800 ¼ $1060withtheDDR2-667 DIMMand2 $100+ $800 ¼ $1000withtheDDR2-533DIMM.Thelatency toservicealevel-2missis14 (1/333MHz) ¼ 42ns80%ofthetimeand9 (1/333MHz) ¼ 27ns20%ofthetimewiththeDDR2-667DIMM. Itis12 (1/266MHz) ¼ 45ns(80%ofthetime)and8 (1/266MHz) ¼ 30ns (20%ofthetime)withtheDDR-533DIMM.TheCPIaddedbythelevel-2misses inthecaseofDDR2-667is0.00333 42 0.8+0.00333 27 0.2 ¼ 0.130givingatotalof1.5+0.130 ¼ 1.63.MeanwhiletheCPIaddedbythelevel-2missesfor DDR-533is0.00333 45 0.8+0.00333 30 0.2 ¼ 0.140givingatotalof1.5 +0.140 ¼ 1.64.Thus,thedropisonly1.64/1.63 ¼ 1.006,or0.6%,whilethecostis $1060/$1000 ¼ 1.06or6.0%greater.Thecost/performanceoftheDDR2-667

Architecture - A

Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

systemis1.63 1060 ¼ 1728whilethecost/performanceoftheDDR2-533system is1.64 1000 ¼ 1640,sotheDDR2-533systemisabettervalue.

2.32 Thecoreswillbeexecuting8cores 3GHz/2.0CPI ¼ 12billioninstructionsper second.Thiswillgenerate12 0.00667 ¼ 80millionlevel-2missespersecond. Withtheburstlengthof8,thiswouldbe80 32B ¼ 2560MB/s.Ifthememory bandwidthissometimes2Xthis,itwouldbe5120MB/s.FromFig.2.14,this isjustbarelywithinthebandwidthprovidedbyDDR2-667DIMMs,sojustone memorychannelwouldsuffice.

2.33 Wewillassumethatapplicationsexhibitspatiallocalityandthataccessestoconsecutivememoryblockswillbeissuedinashorttimewindow.Ifconsecutive blocksareinthesamebank,theywillyieldrowbufferhits.Whilethisreduces Activationenergy,thetwoblockshavetobefetchedsequentially.Thesecond access,arowbufferhit,willexperiencelowerlatencythanthefirstaccess.Ifconsecutiveblocksareinbanksondifferentchannels,theywillbothberowbuffer misses,butthetwoaccessescanbeperformedinparallel.Thus,interleavingconsecutiveblocksacrossdifferentchannelsandbankscanyieldlowerlatencies,but canalsoconsumemorememorypower.

2.34a. Thesystembuiltfrom1GbDRAMswillhavetwiceasmanybanksasthe systembuiltfrom2GbDRAMs.Thus,the1Gb-basedsystemshould providehigherperformancebecauseitcanhavemorebanks simultaneouslyopen.

b. Thepowerrequiredtodrivetheoutputlinesisthesameinbothcases,butthesystembuiltwiththex4DRAMswouldrequireactivatingbankson18DRAMs,versusonly9DRAMsforthex8parts.Thepagesizeactivatedoneachx4andx8part arethesame,andtakeroughlythesameactivationenergy.Thus,becausethereare fewerDRAMsbeingactivatedinthex8designoption,itwouldhavelowerpower.

2.35a. Withpolicy1,

PrechargedelayTrp ¼ 5 (1/333MHz) ¼ 15ns

ActivationdelayTrcd ¼ 5 (1/333MHz) ¼ 15ns

ColumnselectdelayTcas ¼ 4 (1/333MHz) ¼ 12ns

Accesstimewhenthereisarowbufferhit

Th ¼ rTcas+Tddr ðÞ 100

Accesstimewhenthereisamiss

Tm ¼ 100 r ðÞ Trp+Trcd+Tcas+Tddr ðÞ 100

Withpolicy2,

Accesstime ¼ Trcd+Tcas+Tddr IfAisthetotalnumberofaccesses,thetip-offpointwilloccurwhenthenet accesstimewithpolicy1isequaltothetotalaccesstimewithpolicy2. thatis,

Architecture - A Quantitative Approach

(6th Edition) -tions Manual (John L. Hennessy, David A. Patterson

Chapter2Solutions ■ 9

100 Tcas+Tddr ðÞA+ 100 r 100 Trp+Trcd+Tcas+Tddr ðÞA

¼ Trcd+Tcas+Tddr ðÞA ) r ¼ 100 Trp Trp Trcd

r ¼ 100 15 ðÞ= 15+15 ðÞ¼ 50%

Ifrislessthan50%,thenwehavetoproactivelycloseapagetogetthebest performance,elsewecankeepthepageopen.

b. ThekeybenefitofclosingapageistohidetheprechargedelayTrpfromthe criticalpath.Iftheaccessesarebacktoback,thenthisisnotpossible.Thisnew constrainwillnotimpactpolicy1. Thenewequationsforpolicy2, Accesstimewhenwecanhideprechargedelay ¼ Trcd+Tcas+Tddr Accesstimewhenprechargedelayisinthecriticalpath ¼ Trcd+Tcas+Trp +Tddr

Equation1willnowbecome,

r 100 Tcas + Tddr ðÞA + 100 r 100 Trp + Trcd + Tcas + Tddr ðÞA

¼ 0 9 Trcd + Tcas + Tddr ðÞA +0 1 Trcd + Tcas + Trp + Tddr ðÞ

c. Foranyrowbufferhitrate,policy2requiresadditional r (2+4)nJperaccess. If r ¼ 50%,thenpolicy2requires3nJofadditionalenergy.

2.36 HibernatingwillbeusefulwhenthestaticenergysavedinDRAMisatleastequal totheenergyrequiredtocopyfromDRAMtoFlashandthenbacktoDRAM. DRAMdynamicenergytoread/writeisnegligiblecomparedtoFlashandcanbe ignored. Time ¼ 8 109 2 2 56 10 6 64 1:6 ¼ 400 seconds

Thefactor2intheaboveequationisbecausetohibernateandwakeup,bothFlash andDRAMhavetobereadandwrittenonce.

2.37a. Yes.TheapplicationandproductionenvironmentcanberunonaVMhostedon adevelopmentmachine.

b. Yes.ApplicationscanberedeployedonthesameenvironmentontopofVMs runningondifferenthardware.Thisiscommonlycalledbusinesscontinuity.

c. No.Dependingonsupportinthearchitecture,virtualizingI/Omayaddsignificantorverysignificantperformanceoverheads.

Architecture -

A Quantitative Approach (6th Edition)

-tions

Manual (John L. Hennessy, David A. Patterson

d. Yes.Applicationsrunningondifferentvirtualmachinesareisolatedfrom eachother.

e. Yes.See “Devirtualizablevirtualmachinesenablinggeneral,single-node, onlinemaintenance,” DavidLowell,YasushiSaito,andEileenSamberg,in theProceedingsofthe11thASPLOS,2004,pages211–223.

2.38a. Programsthatdoalotofcomputation,buthavesmallmemoryworkingsetsand dolittleI/Oorothersystemcalls.

b. Theslowdownabovepreviouslywas60%for10%,so20%systemtimewould run120%slower.

c. Themedianslowdownusingpurevirtualizationis10.3,whileforparavirtualization,themedianslowdownis3.76.

d. ThenullcallandnullI/Ocallhavethelargestslowdown.Thesehavenoreal worktooutweighthevirtualizationoverheadofchangingprotectionlevels,so theyhavethelargestslowdowns.

2.39 ThevirtualmachinerunningontopofanothervirtualmachinewouldhavetoemulateprivilegelevelsasifitwasrunningonahostwithoutVT-xtechnology.

2.40a. Asofthedateofthecomputerpaper,AMD-Vaddsmoresupportforvirtualizingvirtualmemory,soitcouldprovidehigherperformanceformemoryintensiveapplicationswithlargememoryfootprints.

b. Bothprovidesupportforinterruptvirtualization,butAMD’sIOMMUalsoadds capabilitiesthatallowsecurevirtualmachineguestoperatingsystemaccessto selecteddevices.

2.41 Openhands-onexercise,nofixedsolution

2.42 Anaggressiveprefetcherbringsinusefulblocksaswellasseveralblocksthatare notimmediatelyuseful.Ifprefetchedblocksareplacedinthecache(orinaprefetchbufferforthatmatter),theymayevictotherblocksthatareimminentlyuseful, thuspotentiallydoingmoreharmthangood.Asecondsignificantdownsideisan increaseinmemoryutilization,thatmayincreasequeuingdelaysfordemand accesses.Thisisespeciallyproblematicinmulticoresystemswherethebandwidth isnearlysaturatedandatapremium.

2.43a. Theseresultsarefromexperimentsona3.3GHzIntel ® Xeon® ProcessorX5680 withNehalemarchitecture(westmereat32nm).Thenumberofmissesper1K instructionsofL1Dcacheincreasessignificantlybymorethan300Xwhen inputdatasizegoesfrom8to64KB,andkeepsrelativelyconstantaround 300/1Kinstructionsforallthelargerdatasets.Similarbehaviorwithdifferent flatteningpointsonL2andL3cachesareobserved.

b. TheIPCdecreasesby60%,20%,and66%wheninputdatasizegoesfrom8to 128KB,from128KBto4MB,andfrom4to32MB,respectively.Thisshows theimportanceofallcaches.Amongallthreelevels,L1andL3cachesaremore important.ThisisbecausetheL2cacheintheIntel® Xeon® ProcessorX5680is relativelysmallandslow,withcapacitybeing256KBandlatencybeingaround 11cycles. 10 ■ SolutionstoCaseStudiesandExercises