Chapter1Solutions
CaseStudy1:ChipFabricationCost
1.1a. Yield ¼ 1/(1+(0.04 2))14 ¼ 0.34
b. Itisfabricatedinalargertechnology,whichisanolderplant.Asplantsage, theirprocessgetstuned,andthedefectratedecreases.
1.2a. Phoenix: Diesperwafer ¼
¼ 1= 1+0 04 2 ðÞ ðÞ14 ¼ 0 340 Profit ¼ 724 0 34 30 ¼ $7384 80
b. RedDragon: Diesperwafer ¼
c. Phoenixchips:25,000/724 ¼ 34.5wafersneeded RedDragonchips:50,000/1234 ¼ 40.5wafersneeded Therefore,themostlucrativesplitis40RedDragonwafers,30Phoenixwafers.
1.3a. Defect-freesinglecore ¼ Yield ¼ 1/(1+(0.04 0.25))14 ¼ 0.87 Equationfortheprobabilitythat N aredefectfreeonachip: #combinations (0.87)N (1 0.87)8 N
YieldforPhoenix4:(0.39+0.21+0.06+0.01) ¼ 0.57 YieldforPhoenix2:(0.001+0.0001) ¼ 0.0011 YieldforPhoenix1:0.000004
b. ItwouldbeworthwhiletosellPhoenix4.However,theothertwohavesucha lowprobabilityofoccurringthatitisnotworthsellingthem.
Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
c.
$20 ¼ Wafersize odddpw 0 28
Step1:DeterminehowmanyPhoenix4chipsareproducedforevery Phoenix8chip.
Thereare57/33Phoenix4chipsforeveryPhoenix8chip ¼ 1.73
$30+1:73 $25 ¼ $73:25
CaseStudy2:PowerConsumptioninComputerSystems
1.4a. Energy:1/8.Power:Unchanged.
b. Energy:Energynew/Energyold ¼ (Voltage 1/8)2/Voltage2 ¼ 0.156
Power:Powernew/Powerold ¼ 0.156 (Frequency 1/8)/Frequency ¼ 0.00195
c. Energy:Energynew/Energyold ¼ (Voltage 0.5)2/Voltage2 ¼ 0.25
Power:Powernew/Powerold ¼ 0.25 (Frequency 1/8)/Frequency ¼ 0.0313
d. 1core ¼ 25%oftheoriginalpower,runningfor25%ofthetime. 0:25 0:25+0:25 0:2 ðÞ 0:75 ¼ 0:0625+0:0375 ¼ 0:1
1.5a. Amdahl’slaw:1/(0.8/4+0.2) ¼ 1/(0.2+0.2) ¼ 1/0.4 ¼ 2.5
b. 4cores,eachat1/(2.5)thefrequencyandvoltage Energy:Energyquad/Energysingle ¼ 4 (Voltage 1/(2.5))2/Voltage2 ¼ 0.64
Power:Powernew/Powerold ¼ 0.64 (Frequency 1/(2.5))/Frequency ¼ 0.256
c. 2cores+2ASICsvs.4cores
1.6a. WorkloadAspeedup:225,000/13,461 ¼ 16.7 WorkloadBspeedup:280,000/36,465 ¼ 7.7 1/(0.7/16.7+0.3/7.7)
b. General-purpose:0.70 0.42+0.30 ¼ 0.594
GPU:0.70 0.37+0.30 ¼ 0.559 TPU:0.70 0.80+0.30 ¼ 0.886
c. General-purpose:159W+(455W 159W) 0.594 ¼ 335W
GPU:357W+(991W 357W) 0.559 ¼ 711W
TPU:290W+(384W 290W) 0.86 ¼ 371W
d.
Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
Chapter1Solutions ■ 3
GPU:1/(0.4/2.46+0.1/2.76+0.5/1.25) ¼ 1.67
TPU:1/(0.4/41+0.1/21.2+0.5/0.17) ¼ 0.33
e. General-purpose:14,000/504 ¼ 27.8 28
GPU:14,000/1838 ¼ 7.62 8
TPU:14,000/861 ¼ 16.3 17
d. General-purpose:2200/504 ¼ 4.37 4,14,000/(4 504) ¼ 6.74 7
GPU:2200/1838 ¼ 1.2 1,14,000/(1 1838) ¼ 7.62 8
TPU:2200/861 ¼ 2.56 2,14,000/(2 861) ¼ 8.13 9
Exercises
1.7a. Somewherebetween1.410 and1.5510,or28.9 80x
b. 6043in2003,52%growthrateperyearfor12yearsis60,500,000(rounded)
c. 24,129in2010,22%growthrateperyearfor15yearsis1,920,000(rounded)
d. Multiplecoresonachipratherthanfastersingle-coreperformance
e. 2 ¼ x 4 , x ¼ 1.032,3.2%growth
1.8a. 50%
b. Energy:Energynew/Energyold ¼ (Voltage 1/2)2/Voltage2 ¼ 0.25
1.9a. 60%
b. 0.4+0.6 0.2 ¼ 0.58,whichreducestheenergyto58%oftheoriginalenergy
c. newPower/oldPower ¼ ½Capacitance (Voltage 0.8)2 (Frequency 0.6)/½ Capacitance Voltage Frequency ¼ 0.82 0.6 ¼ 0.256oftheoriginalpower.
d. 0.4+0.3 2 ¼ 0.46,whichreducestheenergyto46%oftheoriginalenergy
1.10a. 109/100 ¼ 107
b. 107/107 +24 ¼ 1
c. [needsolution]
1.11a. 35/10,000 3333 ¼ 11.67days
b. Thereareseveralcorrectanswers.Onewouldbethat,withthecurrentsystem, onecomputerfailsapproximatelyevery5min.5minisunlikelytobeenough timetoisolatethecomputer,swapitout,andgetthecomputerbackonline again.10min,however,ismuchmorelikely.Inanycase,itwouldgreatly extendtheamountoftimebefore1/3ofthecomputershavefailedatonce. Becausethecostofdowntimeissohuge,beingabletoextendthisisvery valuable.
c. $90,000 ¼ (x + x + x +2x)/4
$360,000 ¼ 5x
$72,000 ¼ x 4thquarter ¼ $144,000/h
Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
FigureS.1 Plotoftheequation: y 5 100/((100 x)+ x/10).
1.12a. SeeFigureS.1.
b. 2 ¼ 1/((1 x)+ x/20) 10/19 ¼ x ¼ 52.6%
c. (0.526/20)/(0.474+0.526/20) ¼ 5.3%
d. Extraspeedupwith2units:1/(0.1+0.9/2) ¼ 1.82.1.82 20 36.4. Totalspeedup:1.95.Extraspeedupwith4units:1/(0.1+0.9/4) ¼ 3.08. 3.08 20 61.5.Totalspeedup:1.97
1.13a. oldexecutiontime ¼ 0.5new+0.5 10new ¼ 5.5new
b. Intheoriginalcode,theunenhancedpartisequalintimetotheenhancedpart (spedupby10),therefore:
(1 x) ¼ x/10
10 10x ¼ x 10 ¼ 11x 10/11 ¼ x ¼ 0.91
1.14a. 1/(0.8+0.20/2) ¼ 1.11
b. 1/(0.7+0.20/2+0.10 3/2) ¼ 1.05
c. fpops:0.1/0.95 ¼ 10.5%,cache:0.15/0.95 ¼ 15.8%
1.15a. 1/(0.5+0.5/22) ¼ 1.91
b. 1/(0.1+0.90/22) ¼ 7.10
c. 41% 22 ¼ 9.Arunson9cores.SpeedupofAon9cores:1/(0.5+0.5/9) ¼ 1.8Overallspeedupif9coreshave1.8speedup,othersnone:1/(0.6+0.4/1.8) ¼ 1.22
d. Calculatevaluesforallprocessorslikeinc.Obtain:1.8,3,1.82,2.5, respectively.
e. 1/(0.41/1.8+0.27/3+0.18/1.82+0.14/2.5) ¼ 2.12
1.16a. 1/(0.2+0.8/N)
b. 1/(0.2+8 0.005+0.8/8) ¼ 2.94
c. 1/(0.2+3 0.005+0.8/8) ¼ 3.17
d. 1/(.2+logN 0.005+0.8/N) e. d/dN (1/((1 P)+logN 0.005+ P/N) ¼ 0)
Architecture -
A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
Chapter2Solutions
CaseStudy1:OptimizingCachePerformancevia AdvancedTechniques
2.1a. Eachelementis8B.Becausea64Bcachelinehas8elements,andeachcolumn accesswillresultinfetchinganewlineforthenonidealmatrix,weneedaminimumof8 8(64elements)foreachmatrix.Hence,theminimumcachesizeis 128 8B ¼ 1KB.
b. Theblockedversiononlyhastofetcheachinputandoutputelementonce.The unblockedversionwillhaveonecachemissforevery64B/8B ¼ 8rowelements.Eachcolumnrequires64B 256ofstorage,or16KB.Thus,column elementswillbereplacedinthecachebeforetheycanbeusedagain.Hence, theunblockedversionwillhave9misses(1rowand8columns)forevery2 intheblockedversion.
c. for(i=0;i < 256; i=i+B){ for(j=0;j < 256;j=j+B){ for(m=0;m < B;m++){ for(n=0;n < B;n++){ output[j+n][i+m]=input[i+m][j+n];
d. 2-waysetassociative.Inadirect-mappedcache,theblockscouldbeallocated sothattheymaptooverlappingregionsinthecache.
e. Youshouldbeabletodeterminethelevel-1cachesizebyvaryingtheblock size.Theratiooftheblockedandunblockedprogramspeedsforarraysthat donotfitinthecacheincomparisontoblocksthatdoisafunctionofthecache blocksize,whetherthemachinehasout-of-orderissue,andthebandwidth providedbythelevel-2cache.Youmayhavediscrepanciesifyourmachine hasawrite-throughlevel-1cacheandthewritebufferbecomesalimiterof performance.
2.2 Becausetheunblockedversionistoolargetofitinthecache,processingeight8B elementsrequiresfetchingone64Browcacheblockand8columncacheblocks. Becauseeachiterationrequires2cycleswithoutmisses,prefetchescanbeinitiated every2cycles,andthenumberofprefetchesperiterationismorethanone,the memorysystemwillbecompletelysaturatedwithprefetches.Becausethelatency ofaprefetchis16cycles,andonewillstartevery2cycles,16/2 ¼ 8willbeoutstandingatatime.
2.3 Openhands-onexercise,nofixedsolution
Architecture - A
Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
CaseStudy2:PuttingitallTogether:HighlyParallel MemorySystems
2.4a. Thesecond-levelcacheis1MBandhasa128Bblocksize.
b. Themisspenaltyofthesecond-levelcacheisapproximately105ns.
c. Thesecond-levelcacheis8-waysetassociative.
d. Themainmemoryis512MB.
e. Walkingthroughpageswitha16Bstridetakes946nsperreference.With250 suchreferencesperpage,thisworksouttoapproximately240msperpage.
2.5a. Hint:ThisisvisibleinthegraphaboveshownasaslightincreaseinL2miss servicetimeforlargedatasets,andis4KBforthegraphabove.
b. Hint:Takeindependentstridesbythepagesizeandlookforincreasesinlatency notattributabletocachesizes.Thismaybehardtodiscerniftheamountof memorymappedbytheTLBisalmostthesameasthesizeasacachelevel.
c. Hint:ThisisvisibleinthegraphaboveshownasaslightincreaseinL2miss servicetimeforlargedatasets,andis15nsinthegraphabove.
d. Hint:TakeindependentstridesthataremultiplesofthepagesizetoseeiftheTLB iffully-associativeorset-associative.Thismaybehardtodiscerniftheamountof memorymappedbytheTLBisalmostthesameasthesizeasacachelevel.
2.6a. Hint:Lookatthespeedofprogramsthateasilyfitinthetop-levelcacheasa functionofthenumberofthreads.
b. Hint:Comparetheperformanceofindependentreferencesasafunctionoftheir placementinmemory.
2.7 Openhands-onexercise,nofixedsolution
CaseStudy3:StudyingtheImpactofVariousMemory SystemOrganizations
2.8 Onarowbuffermiss,thetimetakentoretrievea64-byteblockofdataequals tRP+tRCD+CL+transfertime ¼ 13+13+13+4 ¼ 43ns.
2.9 Onarowbufferhit,thetimetakentoretrievea64-byteblockofdataequals CL+transfertime ¼ 13+4 ¼ 17ns.
2.10 Eachrowbuffermissinvolvessteps(i)–(iii)intheproblemdescription.Beforethe Prechargeofanewreadoperationcanbegin,thePrecharge,Activate,andCASlatenciesofthepreviousreadoperationmustelapse.Inotherwords,back-to-backread operationsareseparatedbytimetRP+tRCD+CL ¼ 39ns.Ofthistime,thememory channelisonlyoccupiedfor4ns.Therefore,thechannelutilizationis4/39 ¼ 10.3%.
2.11
Whenareadoperationisinitiatedinabank,39cycleslater,thechannelisbusyfor 4cycles.Tokeepthechannelbusyallthetime,thememorycontrollershouldinitiatereadoperationsinabankevery4cycles.Becausesuccessivereadoperations toasinglebankmustbeseparatedby39cycles,overthatperiod,thememorycontrollershouldinitiatereadoperationstouniquebanks.Therefore,atleast10banks
Architecture - A Quantitative Approach
(6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
arerequiredifthememorycontrollerhopestoachieve100%channelutilization. Itcantheninitiateanewreadoperationtoeachofthese10banksevery4cycles, andthenrepeattheprocess.Anotherwaytoarriveatthisansweristodivide100by the10.3%utilizationcalculatedpreviously.
2.12 Wecanassumethataccessestoabankalternatebetweenrowbufferhitsandmisses. Thesequenceofoperationsisasfollows:Prechargebeginsattime0;Activateisperformedattime13ns;CASisperformedattime26ns;asecondCAS(therowbuffer hit)isperformedattime30ns;aprechargeisperformedattime43ns;andsoon.Inthe abovesequence,wecaninitiatearowbuffermissandrowbufferhittoabankevery 43ns.So,thechannelisbusyfor8outofevery43ns,thatis,autilizationof18.6%. Wewouldthereforeneedatleast6bankstoachieve100%channelutilization.
2.13 Withasinglebank,thefirstrequestwouldbeservicedafter43ns.Thesecond wouldbeserviced39nslater,thatis,attime82ns.Thethirdandfourthwould beservicedattimes121and160ns.Thisgivesusanaveragememorylatency of(43+82+121+160)/4 ¼ 101.5ns.Notethatwaitinginthequeueisasignificant componentofthisaveragelatency.Ifwehadfourbanks,thefirstrequestwouldbe servicedafter43ns.Assumingthatthefourrequestsgotofourdifferentbanks,the second,third,andfourthrequestsareservicedattimes47,51,and55ns.Thisgives usanaveragememorylatencyof49ns.Theaveragelatencywouldbehigheriftwo oftherequestswenttothesamebank.Notethatbyofferingevenmorebanks,we wouldincreasetheprobabilitythatthefourrequestsgotodifferentbanks.
2.14 Asseeninthepreviousquestions,bygrowingthenumberofbanks,wecansupport higherchannelutilization,thusofferinghigherbandwidthtoanapplication.We havealsoseenthatbyhavingmorebanks,wecansupporthigherparallelism andthereforelowerqueuingdelaysandloweraveragememorylatency.Thus,even thoughthelatencyforasingleaccesshasnotchangedbecauseoflimitationsfrom physics,byboostingparallelismwithbanks,wehavebeenabletoimproveboth averagememorylatencyandbandwidth.Thisarguesforamemorychipthatispartitionedintoasmanybanksaswecanmanage.However,eachbankintroduces overheadsintermsofperipherallogicnearthebank.Tobalanceperformance anddensity(cost),DRAMmanufacturershavesettledonamodestnumberof banksperchip 8forDDR3and16forDDR4.
2.15 Whenchannelutilizationishalved,thepowerreducesfrom535to295mW,a45% reduction.Atthe70%channelutilization,increasingtherowbufferhitratefrom 50%to80%movespowerfrom535to373mW,a30%reduction.
2.16 Thetableisasfollows(assumingdefaultsystemconfig):
Architecture - A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
Notethatthe4GbDRAMchipsaremanufacturedwithabettertechnology,soitis bettertousefewerandlargerchipstoconstructarankwithgivencapacity,ifweare tryingtoreducepower.
Exercises
2.17a. Theaccesstimeofthedirect-mappedcacheis0.86ns,whilethe2-wayand 4-wayare1.12and1.37ns,respectively.Thismakestherelativeaccesstimes 1.12/0.86 ¼ 1.30or30%moreforthe2-wayand1.37/0.86 ¼ 1.59or59%more forthe4-way.
b. Theaccesstimeofthe16KBcacheis1.27ns,whilethe32and64KBare1.35 and1.37ns,respectively.Thismakestherelativeaccesstimes1.35/1.27 ¼ 1.06 or6%largerforthe32KBand1.37/1.27 ¼ 1.078or8%largerforthe64KB.
c. Avg.accesstime ¼ hit% hittime+miss% misspenalty,miss% ¼ missesper instruction/referencesperinstruction ¼ 2.2%(DM),1.2%(2-way),0.33% (4-way),0.09%(8-way).
Directmappedaccesstime ¼ 0.86ns @ 0.5nscycletime ¼ 2cycles 2-waysetassociative ¼ 1.12ns @ 0.5nscycletime ¼ 3cycles 4-waysetassociative ¼ 1.37ns @ 0.83nscycletime ¼ 2cycles 8-waysetassociative ¼ 2.03ns @ 0.79nscycletime ¼ 3cycles
Misspenalty ¼ (10/0.5) ¼ 20cyclesforDMand2-way;10/0.83 ¼ 13cyclesfor 4-way;10/0.79 ¼ 13cyclesfor8-way.
Directmapped (1 0.022) 2+0.022 (20) ¼ 2.396cycles ) 2.396 0.5 ¼ 1.2ns
2-way (1 0.012) 3+0.012 (20) ¼ 3.2cycles ) 3.2 0.5 ¼ 1.6ns
4-way (1 0.0033) 2+0.0033 (13) ¼ 2.036cycles ) 2.06 0.83 ¼ 1.69ns 8-way (1 0.0009) 3+0.0009 13 ¼ 3cycles ) 3 0.79 ¼ 2.37ns
Directmappedcacheisthebest.
2.18a. Theaveragememoryaccesstimeofthecurrent(4-way64KB)cacheis1.69ns. 64KBdirectmappedcacheaccesstime ¼ 0.86ns @ 0.5nscycletime ¼ 2 cyclesWay-predictedcachehascycletimeandaccesstimesimilartodirect mappedcacheandmissratesimilarto4-waycache.
TheAMAToftheway-predictedcachehasthreecomponents:miss,hitwith waypredictioncorrect,andhitwithwaypredictionmispredict:0.0033 (20)+(0.80 2+(1 0.80) 3) (1 0.0033) ¼ 2.26cycles ¼ 1.13ns.
b. Thecycletimeofthe64KB4-waycacheis0.83ns,whilethe64KBdirectmappedcachecanbeaccessedin0.5ns.Thisprovides0.83/0.5 ¼ 1.66or 66%fastercacheaccess.
c. With1cyclewaymispredictionpenalty,AMATis1.13ns(asperparta),but witha15cyclemispredictionpenalty,theAMATbecomes:0.0033 20 +(0.80 2+(1 0.80) 15) (1 0.0033) ¼ 4.65cyclesor2.3ns.
d. Theserialaccessis2.4ns/1.59ns ¼ 1.509or51%slower.
Architecture - A Quantitative Approach (6th Edition) -tions Manual
(John L. Hennessy, David A. Patterson
Chapter2Solutions ■ 5
2.19a. Theaccesstimeis1.12ns,whilethecycletimeis0.51ns,whichcouldbe potentiallypipelinedasfinelyas1.12/0.51 ¼ 2.2pipestages.
b. Thepipelineddesign(notincludinglatchareaandpower)hasanareaof 1.19mm2 andenergyperaccessof0.16nJ.Thebankedcachehasanareaof 1.36mm2 andenergyperaccessof0.13nJ.Thebankeddesignusesslightly moreareabecauseithasmoresenseampsandothercircuitrytosupportthe twobanks,whilethepipelineddesignburnsslightlymorepowerbecausethe memoryarraysthatareactivearelargerthaninthebankedcase.
2.20a. Withcriticalwordfirst,themissservicewouldrequire120cycles.Withoutcriticalwordfirst,itwouldrequire120cyclesforthefirst16Band16cyclesfor eachofthenext316Bblocks,or120+(3 16) ¼ 168cycles.
b. ItdependsonthecontributiontoAverageMemoryAccessTime(AMAT)of thelevel-1andlevel-2cachemissesandthepercentreductioninmissservice timesprovidedbycriticalwordfirstandearlyrestart.Ifthepercentagereductioninmissservicetimesprovidedbycriticalwordfirstandearlyrestartis roughlythesameforbothlevel-1andlevel-2missservice,theniflevel-1misses contributemoretoAMAT,criticalwordfirstwouldlikelybemoreimportant forlevel-1misses.
2.21a. 16B,tomatchthelevel2datacachewritepath.
b. Assumemergingwritebufferentriesare16Bwide.Becauseeachstorecan write8B,amergingwritebufferentrywouldfillupin2cycles.Thelevel-2 cachewilltake4cyclestowriteeachentry.Anonmergingwritebufferwould take4cyclestowritethe8Bresultofeachstore.Thismeansthemergingwrite bufferwouldbetwotimesfaster.
c. Withblockingcaches,thepresenceofmisseseffectivelyfreezesprogressmade bythemachine,sowhethertherearemissesornotdoesn’tchangetherequired numberofwritebufferentries.Withnonblockingcaches,writescanbeprocessedfromthewritebufferduringmisses,whichmaymeanfewerentries areneeded.
2.22 Inallthreecases,thetimetolookuptheL1cachewillbethesame.What differsisthetimespentservicingL1misses.Incase(a),thattime ¼ 100 (L1misses) 16cycles+10(L2misses) 200cycles ¼ 3600cycles.Incase (b),thattime ¼ 100 4+50 16+10 200 ¼ 3200cycles.Incase(c),that time ¼ 100 2+80 8+40 16+10 200 ¼ 3480cycles.Thebestdesigniscase (b)witha3-levelcache.Goingtoa2-levelcachecanresultinmanylongL2 accesses(1600cycleslookingupL2).Goingtoa4-levelcachecanresultinmany futilelook-upsineachlevelofthehierarchy.
2.23 ProgramBenjoysahighermarginalutilityfromeachadditionalway.Therefore, iftheobjectiveistominimizeoverallMPKI,programBshouldbeassigned asmanywaysaspossible.Byassigning15waystoprogramBand1wayto programA,weachievetheminimumaggregateMPKIof50 (14 2) +100 ¼122.
Architecture -
A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
2.24 Let’sfirstassumeanidealizedperfectL1cache.A1000-instructionprogram wouldfinishin1000cycles,thatis,1000ns.Thepowerconsumptionwouldbe 1WforthecoreandL1,plus0.5Wofmemorybackgroundpower.Theenergy consumedwouldbe1.5W 1000ns ¼ 1.5 μJ.
Next,consideraPMDthathasnoL2cache.A1000-instructionprogramwould finishin1000+100(MPKI) 100ns(latencypermemoryaccess) ¼ 11,000ns. Theenergyconsumedwouldbe11,000ns 1.5W(core,L1,backgroundmemory power)+100(memoryaccesses) 35nJ(energypermemoryaccess) ¼ 16,500 +3500nJ ¼ 20 μJ.
ForthePMDwitha256KBL2,the1000-instructionprogramwouldfinishin 1000+100(L1MPKI) 10ns(L2latency)+20(L2MPKI) 100ns(memory latency) ¼ 4000ns.Energy ¼ 1.7W(core,L1,L2background,memorybackgroundpower) 4000ns+100(L2accesses) 0.5nJ(energyperL2access) +20(memoryaccesses) 35nJ ¼ 6800+50+700nJ ¼ 7.55 μJ.
ForthePMDwitha1MBL2,the1000-instructionprogramwouldfinishin1000 +100 20ns+10 100ns ¼ 4000ns.
Energy ¼ 2.3W 4000ns+100 0.7nJ+10 35nJ ¼ 9200+70+350nJ ¼ 9.62 μJ. Therefore,ofthesedesigns,lowestenergyforthePMDisachievedwitha 256KBL2cache.
2.25 (a)Asmallblocksizeensuresthatweareneverfetchingmorebytesthanrequired bytheprocessor.ThisreducesL2andmemoryaccesspower.Thisisslightlyoffset bytheneedformorecachetagsandhighertagarraypower.However,ifthisleads tomoreapplicationmissesandlongerprogramcompletiontime,itmayultimately resultinhigherapplicationenergy.Forexample,inthepreviousexercise,notice howthedesignwithnoL2cacheresultsinalongexecutiontimeandhighest energy.(b)Asmallcachesizewouldlowercachepower,butitincreasesmemory powerbecausethenumberofmemoryaccesseswillbehigher.Asseeninthepreviousexercise,theMPKIs,latencies,andenergyperaccessultimatelydecideif energywillincreaseordecrease.(c)Higherassociativitywillresultinhigher tagarraypower,butitshouldlowerthemissrateandmemoryaccesspower.It shouldalsoresultinlowerexecutiontime,andeventuallylowerapplication energy.
2.26 (a)TheLRUpolicyessentiallyusesrecencyoftouchtodeterminepriority.A newlyfetchedblockisinsertedattheheadoftheprioritylist.Whenablockis touched,theblockisimmediatelypromotedtotheheadoftheprioritylist.When ablockmustbeevicted,weselecttheblockthatiscurrentlyatthetailofthepriority list.(b)Researchstudieshaveshownthatsomeblocksarenottouchedduringtheir residenceinthecache.AnInsertionpolicythatexploitsthisobservationwould insertarecentlyfetchedblocknearthetailoftheprioritylist.ThePromotionpolicy movesablocktotheheadoftheprioritylistwhentouched.Thisgivesablocka longerresidenceinthecacheonlywhenitistouchedatleasttwicewithinashort
Architecture -
A Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
Chapter2Solutions ■ 7
window.ItmayalsobereasonabletoimplementaPromotionpolicythatgradually movesablockafewplacesaheadintheprioritylistoneverytouch.
2.27 Thestandardapproachtoisolatingthebehaviorofeachprogramiscachepartitioning.Inthisapproach,eachprogramreceivesasubsetofthewaysintheshared cache.WhenprovidingQoS,thewaysallocatedtoeachprogramcanbedynamicallyvariedbasedonprogrambehaviorandtheservicelevelsguaranteedtoeach program.Whenprovidingprivacy,theallocationofwayshastobedetermined beforehandandcannotvaryatruntime.Tuningtheallocationbasedontheprograms’ currentneedswouldresultininformationleakage.
2.28 ANUCAcacheyieldshigherperformanceifmorerequestscanbeservicedbythe banksclosesttotheprocessor.Notethatwealreadyimplementaprioritylist(typicallybasedonrecencyofaccess)withineachsetofthecache.Thisprioritylistcan beusedtomapblockstotheNUCAbanks.Forexample,frequentlytouchedblocks aregraduallypromotedtolow-latencybankswhileblocksthathaven’tbeentouched recentlyaredemotedtohigher-latencybanks.Notethatsuchblockmigrationswithin thecachewillincreasecachepower.Theymayalsocomplicatecachelook-up.
2.29a. A2GBDRAMwithparityorECCeffectivelyhas9bitbytes,andwould require181GbDRAMs.Tocreate72outputbits,eachonewouldhavetooutput72/18 ¼ 4bits.
b. Aburstlengthof4readsout32B.
c. TheDDR-667DIMMbandwidthis667 8 ¼ 5336MB/s.
TheDDR-533DIMMbandwidthis533 8 ¼ 4264MB/s.
2.30a. Thisissimilartothescenariogiveninthefigure,buttRCDandCLareboth5.In addition,wearefetchingtwotimesthedatainthefigure.Thus,itrequires5+5 +4 2 ¼ 18cyclesofa333MHzclock,or18 (1/333MHz) ¼ 54.0ns.
b. Thereadtoanopenbankrequires5+4 ¼ 9cyclesofa333MHzclock,or 27.0ns.Inthecaseofabankactivate,thisis14cycles,or42.0ns.Including 20nsformissprocessingonchip,thismakesthetwo42+20 ¼ 61nsand27.0 +20 ¼ 47ns.Includingtimeonchip,thebankactivatetakes61/47 ¼ 1.30or 30%longer.
2.31 Thecostsofthetwosystemsare $2 130+ $800 ¼ $1060withtheDDR2-667 DIMMand2 $100+ $800 ¼ $1000withtheDDR2-533DIMM.Thelatency toservicealevel-2missis14 (1/333MHz) ¼ 42ns80%ofthetimeand9 (1/333MHz) ¼ 27ns20%ofthetimewiththeDDR2-667DIMM. Itis12 (1/266MHz) ¼ 45ns(80%ofthetime)and8 (1/266MHz) ¼ 30ns (20%ofthetime)withtheDDR-533DIMM.TheCPIaddedbythelevel-2misses inthecaseofDDR2-667is0.00333 42 0.8+0.00333 27 0.2 ¼ 0.130givingatotalof1.5+0.130 ¼ 1.63.MeanwhiletheCPIaddedbythelevel-2missesfor DDR-533is0.00333 45 0.8+0.00333 30 0.2 ¼ 0.140givingatotalof1.5 +0.140 ¼ 1.64.Thus,thedropisonly1.64/1.63 ¼ 1.006,or0.6%,whilethecostis $1060/$1000 ¼ 1.06or6.0%greater.Thecost/performanceoftheDDR2-667
Architecture - A
Quantitative Approach (6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
systemis1.63 1060 ¼ 1728whilethecost/performanceoftheDDR2-533system is1.64 1000 ¼ 1640,sotheDDR2-533systemisabettervalue.
2.32 Thecoreswillbeexecuting8cores 3GHz/2.0CPI ¼ 12billioninstructionsper second.Thiswillgenerate12 0.00667 ¼ 80millionlevel-2missespersecond. Withtheburstlengthof8,thiswouldbe80 32B ¼ 2560MB/s.Ifthememory bandwidthissometimes2Xthis,itwouldbe5120MB/s.FromFig.2.14,this isjustbarelywithinthebandwidthprovidedbyDDR2-667DIMMs,sojustone memorychannelwouldsuffice.
2.33 Wewillassumethatapplicationsexhibitspatiallocalityandthataccessestoconsecutivememoryblockswillbeissuedinashorttimewindow.Ifconsecutive blocksareinthesamebank,theywillyieldrowbufferhits.Whilethisreduces Activationenergy,thetwoblockshavetobefetchedsequentially.Thesecond access,arowbufferhit,willexperiencelowerlatencythanthefirstaccess.Ifconsecutiveblocksareinbanksondifferentchannels,theywillbothberowbuffer misses,butthetwoaccessescanbeperformedinparallel.Thus,interleavingconsecutiveblocksacrossdifferentchannelsandbankscanyieldlowerlatencies,but canalsoconsumemorememorypower.
2.34a. Thesystembuiltfrom1GbDRAMswillhavetwiceasmanybanksasthe systembuiltfrom2GbDRAMs.Thus,the1Gb-basedsystemshould providehigherperformancebecauseitcanhavemorebanks simultaneouslyopen.
b. Thepowerrequiredtodrivetheoutputlinesisthesameinbothcases,butthesystembuiltwiththex4DRAMswouldrequireactivatingbankson18DRAMs,versusonly9DRAMsforthex8parts.Thepagesizeactivatedoneachx4andx8part arethesame,andtakeroughlythesameactivationenergy.Thus,becausethereare fewerDRAMsbeingactivatedinthex8designoption,itwouldhavelowerpower.
2.35a. Withpolicy1,
PrechargedelayTrp ¼ 5 (1/333MHz) ¼ 15ns
ActivationdelayTrcd ¼ 5 (1/333MHz) ¼ 15ns
ColumnselectdelayTcas ¼ 4 (1/333MHz) ¼ 12ns
Accesstimewhenthereisarowbufferhit
Th ¼ rTcas+Tddr ðÞ 100
Accesstimewhenthereisamiss
Tm ¼ 100 r ðÞ Trp+Trcd+Tcas+Tddr ðÞ 100
Withpolicy2,
Accesstime ¼ Trcd+Tcas+Tddr IfAisthetotalnumberofaccesses,thetip-offpointwilloccurwhenthenet accesstimewithpolicy1isequaltothetotalaccesstimewithpolicy2. thatis,
Architecture - A Quantitative Approach
(6th Edition) -tions Manual (John L. Hennessy, David A. Patterson
Chapter2Solutions ■ 9
r
100 Tcas+Tddr ðÞA+ 100 r 100 Trp+Trcd+Tcas+Tddr ðÞA
¼ Trcd+Tcas+Tddr ðÞA ) r ¼ 100 Trp Trp Trcd
r ¼ 100 15 ðÞ= 15+15 ðÞ¼ 50%
Ifrislessthan50%,thenwehavetoproactivelycloseapagetogetthebest performance,elsewecankeepthepageopen.
b. ThekeybenefitofclosingapageistohidetheprechargedelayTrpfromthe criticalpath.Iftheaccessesarebacktoback,thenthisisnotpossible.Thisnew constrainwillnotimpactpolicy1. Thenewequationsforpolicy2, Accesstimewhenwecanhideprechargedelay ¼ Trcd+Tcas+Tddr Accesstimewhenprechargedelayisinthecriticalpath ¼ Trcd+Tcas+Trp +Tddr
Equation1willnowbecome,
r 100 Tcas + Tddr ðÞA + 100 r 100 Trp + Trcd + Tcas + Tddr ðÞA
¼ 0 9 Trcd + Tcas + Tddr ðÞA +0 1 Trcd + Tcas + Trp + Tddr ðÞ
c. Foranyrowbufferhitrate,policy2requiresadditional r (2+4)nJperaccess. If r ¼ 50%,thenpolicy2requires3nJofadditionalenergy.
2.36 HibernatingwillbeusefulwhenthestaticenergysavedinDRAMisatleastequal totheenergyrequiredtocopyfromDRAMtoFlashandthenbacktoDRAM. DRAMdynamicenergytoread/writeisnegligiblecomparedtoFlashandcanbe ignored. Time ¼ 8 109 2 2 56 10 6 64 1:6 ¼ 400 seconds
Thefactor2intheaboveequationisbecausetohibernateandwakeup,bothFlash andDRAMhavetobereadandwrittenonce.
2.37a. Yes.TheapplicationandproductionenvironmentcanberunonaVMhostedon adevelopmentmachine.
b. Yes.ApplicationscanberedeployedonthesameenvironmentontopofVMs runningondifferenthardware.Thisiscommonlycalledbusinesscontinuity.
c. No.Dependingonsupportinthearchitecture,virtualizingI/Omayaddsignificantorverysignificantperformanceoverheads.
Architecture -
A Quantitative Approach (6th Edition)
-tions
Manual (John L. Hennessy, David A. Patterson
d. Yes.Applicationsrunningondifferentvirtualmachinesareisolatedfrom eachother.
e. Yes.See “Devirtualizablevirtualmachinesenablinggeneral,single-node, onlinemaintenance,” DavidLowell,YasushiSaito,andEileenSamberg,in theProceedingsofthe11thASPLOS,2004,pages211–223.
2.38a. Programsthatdoalotofcomputation,buthavesmallmemoryworkingsetsand dolittleI/Oorothersystemcalls.
b. Theslowdownabovepreviouslywas60%for10%,so20%systemtimewould run120%slower.
c. Themedianslowdownusingpurevirtualizationis10.3,whileforparavirtualization,themedianslowdownis3.76.
d. ThenullcallandnullI/Ocallhavethelargestslowdown.Thesehavenoreal worktooutweighthevirtualizationoverheadofchangingprotectionlevels,so theyhavethelargestslowdowns.
2.39 ThevirtualmachinerunningontopofanothervirtualmachinewouldhavetoemulateprivilegelevelsasifitwasrunningonahostwithoutVT-xtechnology.
2.40a. Asofthedateofthecomputerpaper,AMD-Vaddsmoresupportforvirtualizingvirtualmemory,soitcouldprovidehigherperformanceformemoryintensiveapplicationswithlargememoryfootprints.
b. Bothprovidesupportforinterruptvirtualization,butAMD’sIOMMUalsoadds capabilitiesthatallowsecurevirtualmachineguestoperatingsystemaccessto selecteddevices.
2.41 Openhands-onexercise,nofixedsolution
2.42 Anaggressiveprefetcherbringsinusefulblocksaswellasseveralblocksthatare notimmediatelyuseful.Ifprefetchedblocksareplacedinthecache(orinaprefetchbufferforthatmatter),theymayevictotherblocksthatareimminentlyuseful, thuspotentiallydoingmoreharmthangood.Asecondsignificantdownsideisan increaseinmemoryutilization,thatmayincreasequeuingdelaysfordemand accesses.Thisisespeciallyproblematicinmulticoresystemswherethebandwidth isnearlysaturatedandatapremium.
2.43a. Theseresultsarefromexperimentsona3.3GHzIntel ® Xeon® ProcessorX5680 withNehalemarchitecture(westmereat32nm).Thenumberofmissesper1K instructionsofL1Dcacheincreasessignificantlybymorethan300Xwhen inputdatasizegoesfrom8to64KB,andkeepsrelativelyconstantaround 300/1Kinstructionsforallthelargerdatasets.Similarbehaviorwithdifferent flatteningpointsonL2andL3cachesareobserved.
b. TheIPCdecreasesby60%,20%,and66%wheninputdatasizegoesfrom8to 128KB,from128KBto4MB,andfrom4to32MB,respectively.Thisshows theimportanceofallcaches.Amongallthreelevels,L1andL3cachesaremore important.ThisisbecausetheL2cacheintheIntel® Xeon® ProcessorX5680is relativelysmallandslow,withcapacitybeing256KBandlatencybeingaround 11cycles. 10 ■ SolutionstoCaseStudiesandExercises