LosslessFFTsUsingPositArithmetic
SiewHoonLeong1(B) andJohnL.Gustafson2
1 SwissNationalSupercomputingCentre,ETHZurich,Zurich,Switzerland cerlane.leong@cscs.ch 2 ArizonaStateUniversity,Tempe,USA jlgusta6@asu.edu
Abstract. TheFastFourierTransform(FFT)isrequiredforchemistry, weather,defense,andsignalprocessingforseismicexplorationandradio astronomy.Itiscommunication-bound,makingsupercomputersthousandsoftimessloweratFFTsthenatdenselinearalgebra.Thekey toacceleratingFFTsistominimizebitsperdatumwithoutsacrificing accuracy.The16-bitfixedpointandIEEEfloattypelacksufficientaccuracyfor1024-and4096-pointFFTsofdatafromanalog-to-digitalconverters.Weshowthatthe16-bitposit,withhigheraccuracyandlarger dynamicrange,canperformFFTssoaccuratelythataforward-inverse FFTrestorestheoriginalsignalperfectly.“Reversible”FFTswithposits arelossless,eliminatingtheneedfor32-bitorhigherprecision.Similarly, 32-bitpositFFTscanreplace64-bitfloatFFTsformanyHPCtasks. Speed,energyefficiency,andstoragecostscanthusbeimprovedby2× forabroadrangeofHPCworkloads.
Keywords: Posit · Quire · FFT · ComputerArithmetic
1Introduction
The posit ™ numberformatistheroundedformofTypeIIIuniversalnumber (unum)arithmetic[13, 16].ItevolvedfromTypeIIunumsinDecember2016as ahardware-friendlydrop-inalternativetothefloating-pointIEEEStd754™ [19]. Thetaperedaccuracyofpositsallowsthemtohavemorefractionbitsinthemost commonlyusedrange,thusenablingpositstobemoreaccuratethanfloatingpointnumbers(floats)ofthesamesize,yethaveanevenlargerdynamicrange thanfloats.Positarithmeticalsointroducesthe quire,anexactaccumulatorfor fuseddotproducts,thatcandramaticallyreduceroundingerrors.
ThecomputationoftheDiscreteFourierTransform(DFT)usingtheFast FourierTransform(FFT)algorithmhasbecomeoneofthemostimportantand powerfultoolsofHighPerformanceComputing(HPC).FFTsareinvestigated heretodemonstratethespeedandaccuracyofpositarithmeticwhencompared tofloating-pointarithmetic.ImprovingFFTscanpotentiallyimprovetheperformanceofHPCapplicationssuchasCP2K[18],SpecFEM3D[20, 21],and SupportedbyorganizationA*STARandNSCCSingapore.
c TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2023 J.Gustafsonetal.(Eds.):CoNGA2023,LNCS13851,pp.1–18,2023. https://doi.org/10.1007/978-3-031-32180-1 1
WRF[31],leadingtohigherspeedandimprovedaccuracy.“Precision”here referstothenumberofbitsinanumberformat,and“accuracy”referstothe correctnessofananswer,measuredinthenumberofcorrectdecimalsorcorrect bits.Positsachieveordersofmagnitudesmallerroundingerrorswhencompared tofloatsthathavethesameprecision[26, 27].Thus,thecommonly-used32-bit and64-bitprecisionfloatscanpotentiallybereplacedwith16-bitand32-bit positsrespectivelyinsomeapplications,doublingthespeedofcommunicationboundcomputationsandhalvingthestorageandpowercosts.
WesayFFTaccuracyis lossless whentheinverseFFTreproducestheoriginal signalperfectly;thatis,theFFTis reversible.Theroundingerrorof16-bitIEEE floatsand16-bitfixed-pointformataddstoomuchnoiseforthoseformatsto performlosslessFFTs,whichforcesprogrammerstouse32-bitfloatsforsignal processingtasks.Incontrast,16-bitpositshaveenoughaccuracyfora“round trip”tobelosslessattheresolutionofcommonAnalog-to-DigitalConverters (ADCs)thatsupplytheinputdata.Oncereversibilityisachieved,theuseof morebitsofprecisioniswastefulsincethetransformationretainsallinformation.
Thepaperisorganizedasfollow:InSect. 2,relatedworkonposits,fixedpointFFTsandfloating-pointFFTswillbeshared.Backgroundinformationon thepositandquireformatandtheFFTisprovidedinSect. 3.Section 4 presents theapproachusedtoevaluatetheaccuracyandperformanceofradix-2and radix-4(1024-and4096-point)FFTs(Decimation-In-TimeandDecimation-InFrequency)usingboth16-bitpositsand16-bitfloats.TheresultsoftheevaluationarediscussedinSect. 5.Finally,theconclusionsandplansforfuturework arepresentedinSect. 6.
2RelatedWork
PositsareanewformofcomputerarithmeticinventedbyGustafsoninDecember2016.TheconceptwasfirstpubliclysharedasaStanfordlectureseminar[16] inFebruary2017.Thefirstpeer-reviewedpositjournalpaper[15]waspublished inJune2017.Sincethen,studiesonpositcorrectness[6],accuracy,efficiency whencomparedtofloats[26]andvarioussoftwareandField-ProgrammableGate Array(FPGA)implementations[5, 24, 29]havebeenperformed.Duetotheflexibilitytochoosetheprecisionrequiredandexpresshighdynamicrangeusing veryfewbits,researchershavefoundpositsparticularlywell-suitedtomachine learningapplications.[23]hasdemonstratedthatverylowprecisionpositsoutperformfixed-pointandallothertestedformatsforinference,thusimproving speedfortime-criticalAItaskssuchasself-drivingcars.
EffortstoimproveDFTsbyreducingthenumberofoperationscanbetraced backtotheworkofCooleyandTukey[8]in1965,whoseimprovementsbased onthealgorithmofGood[12]reducedtheoperationcomplexityfrom O (N 2 )to O (N log 2 N ),nowcalledFFTs[7,p.1667].Additionalworktofurtherimprove theFFTalgorithmledtoradix-2m algorithms[9, 30],theRader-Brenneralgorithm[30],theWinogradalgorithm(WFTA)[35, 36]andprimefactoralgorithms (PFA)[11, 33].Inpractice,radix-2,radix-4,andsplit-radixarethemostwidely adoptedtypes.
Theeffectofarithmeticprecisiononperformancehasbeenstudied[1, 1]. Foroptimumspeed,thelowestpossiblebitprecisionshouldbeusedthatstill meetsaccuracyrequirements.Fixed-pointasopposedtofloating-pointistraditionallyusedtoimplementFFTalgorithmsincustomDigitalSignalProcessing (DSP)hardware[3]duetothehighercostandcomplexityoffloating-pointlogic. Fixed-pointreducespowerdissipationandachieveshigherspeed[3, 25].Custom Application-SpecificIntegratedCircuits(ASICs)andFPGAsallowtheuseof unusualandvariablewordsizeswithoutaspeedpenalty,butthatflexibilityis notavailableinageneralprogrammingenvironment.
3Background
Inthissection,positformat,thecorresponding quire exactaccumulatorformat, andtheFFTalgorithmarediscussed.
3.1Posits
Positsaremuchsimplerthanfloats,whichcanpotentiallyresultinfastercircuits requiringlesschiparea[2].ThePositStandard(2022)Themainadvantagesof positsoverfloatsare:
–Higheraccuracyforthemostcommonly-usedvaluerange –1-to-1mapofsignedbinaryintegerstoorderedrealnumbers –Bitwisereproducibilityacrossallcomputingsystems –Increaseddynamicrange
–Moreinformationperbit(higherShannonentropy) –Onlytwoexceptionvalues:zeroandNot-a-Real(NaR) –Supportforassociativeanddistributivelaws
Thedifferencesamong16-bitfloat,fixed-point,andpositformatsaredisplayedinFig. 1 whereeachcolorblockisabit,i.e.0or1.Thecolorsdepictthe fields(sign, exponent,fraction, integer,or regime)thatthebitsrepresent.
A16-bitfloat(Fig. 1a)consistsofasignbitand5exponentbits,leaving 10bitsforfractionafterthe“hiddenbit”.Ifsignaldataiscenteredabout0so thesignbitissignificant,a16-bitfloatiscapableofstoringsignedvaluesfrom ADCswithupto12bitsofoutput,butnolarger.
Althoughthenumberofbitstotheleftoftheradixpointcanbeflexibly chosenforafixed-pointformat(Fig. 1b),1024-pointFFTsofdataintherange 1to1canproducevaluesbetween 32and32ingeneral.Therefore,a16-bit fixed-pointformatfortheFFTrequiresintegerbitsthatincreasefrom2to6 withthestagesoftheFFT,leaving13to9bitsrespectivelyforthefraction part.Atfirstglance,fixed-pointwouldappeartohavethebestaccuracyfor FFTs,sinceitallowsthemaximumpossiblenumberoffractionbits.However, towardsthefinalstageofa1024-pointFFTcomputation,a16-bitfloatwillstill have10fractionbits(excludingthehiddenbit)whilefixed-pointwillonlyhave
(a)Floating-point
(b)Fixed-point
9fractionbitstoaccommodatethelargerworst-caseintegerpartitneedsto storetoavoidcatastrophicoverflow.For4096-pointFFTs,fixed-pointwillonly have8fractionbits.Positswillhave10to12fractionbitsfortheresultsofthe FFT.Consequently,16-bitfixed-pointhasthe lowest accuracyamongthethree numberformatsforboth1024-and4096-pointFFTs;positshavethehighest accuracy(seeFig. 2).Notealsothatthe“twiddlefactors”aretrigonometric functionsintherange 1to1,whichpositsrepresentwithabout0.6decimals greateraccuracythanfloats.
Aswithfloatsandintegers,themostsignificantbitofapositindicates thesign.The“regime”bitsusessignedunaryencodingrequiring2to15bits (Fig. 1c).Accuracytapers,withthehighestaccuracyforvalueswithmagnitudes near1andlessaccuracyforthelargestandsmallestmagnitudenumbers.Posit arithmetichardwarerequiresintegeradders,integermultipliers,shifters,leadingzerocountersandANDtreesverysimilartothoserequiredforIEEEfloats; however,posithardwareissimplerinhavingasingleroundingmode,nointernalflags,andonlytwoexceptionvaluestodealwith.Comparisonoperationsare thoseofintegers;noextrahardwareisneeded.Proprietarydesignsshowareductioningatecountforpositsversusfloats,forbothFPGAandVLSIdesigns,and areductioninoperationlatency[2].
ThedynamicrangeforIEEEStandard16-bitfloatsisfrom2 16 to65504, orabout6.0 × 10 8 to6.5 × 104 (12ordersofmagnitude).Floatsusetapered accuracyforsmall-magnitude(“subnormal”)valuesonly,makingtheirdynamic rangeunbalancedabout1.Thereciprocalofasmall-magnitudefloatoverflowsto infinity.For16-bitposits,theuseofasingle eS exponentbitallowsexpressionof magnitudesfrom2 28 to228 ,orabout3.7 × 10 9 to2.7 × 108 (almost17orders ofmagnitude).Posittaperedprecisionissymmetricalaboutmagnitude1,and reciprocationisclosedandexactforintegerpowersof2.Thus,theaccuracy advantageofpositsdoesnotcomeatthecostofreduceddynamicrange.
(c)Posit
Fig.1. Differentnumberformats
Comparisonof16-bitpositsandfloatsforaccuracy
3.2TheQuireRegister
Theconceptofthe quire [14,pp.80–84],afixed-pointscratchvalue,originates fromtheworkofKulischandMiranker[22].Thequiredatatypeisusedto accumulatedotproductswithnoroundingerrors.Whentheaccumulatedresult isconvertedbacktopositformwithasinglerounding,theresultisa“fuseddot product”.Thequiredatatypeisusedtoaccumulatetheaddition/subtractionof aproductoftwoposits,usingexactfixed-pointarithmetic.Thus,aquiredata typeonlyneedstosupportaddandsubtractoperations,andobeythesamerules asintegeraddandsubtractoperations(augmentedwiththeabilitytohandlea NaRinputvalue).
Tostoretheresultofafuseddotproductwithoutanyrounding,aquire datatypemustminimallysupporttherange[minPos2 , maxPos2 ],where minPos isthesmallestexpressiblerealgreaterthanzeroand maxPos isthebiggest expressiblereal,foraparticular n-bitposit.Sincetherewillbeaneedforthe quiretoaccumulatetheresultsoffuseddotproductsoflongvectors,additional n 1bitsareprependedtothemostsignificantbitsascarryoverflowprotection. Thus,a16-bitpositwitha1-bit eS (posit 16, 1 )willhaveacorresponding128bitquire,notatedquire128 16, 1
Theuseofthequirereducescumulativeroundingerror,aswillbedemonstratedinSects. 4 and 5,andenablescorrectly-roundedfuseddotproducts. Noticethatthecomplexmultiply-addinanFFTcanbeexpressedasapair ofdotproducts,soallofthecomplexrotationsintheFFTneedincuronly oneroundingperrealandimaginarypart,insteadoffour(ifalloperationsare rounded)ortwo(iffusedmultiply-addisused).
3.3TheFFTAlgorithm
ThediscreteformoftheFouriertransform(DFT)canbewrittenas
Fig.2.
where
xn isthereal-valuedsequenceof N data-points, Xk isthecomplex-valuedsequenceof N data-points, k=0,...,N 1
Thesumisscaledby 1 √N suchthattheinverseDFThasthesameformother thanthesignoftheexponent,whichrequiresreversingthedirectionofthe angularrotationfactors(theimaginarypartofthecomplexvalue),commonly knownas“twiddles”or“twiddlefactors”:
Followingthisconvention,thetwiddlefactor e2πikn/N willbewrittenas w forshort.Whileitisalsopossibletohavenoscalinginonedirectionanda scalingof 1 N intheother,thishasonlythemeritthatitsavesoneoperationper pointinaforward-inversetransformation.Scalingby 1 √N makesbothforward andinversetransforms unitary andconsistent.TheformsasshowninEqn 1 and 2 havetheadditionaladvantagethattheykeepintermediatevaluesfrom growinginmagnitudeunnecessarily,apropertythatiscrucialforfixed-point arithmetictopreventoverflow,anddesirableforpositarithmeticsinceitmaximizesaccuracy.TheonlyvariantfromtraditionalFFTalgorithmsusedhereis thatthedatasetisscaledby0.5oneveryradix-4pass,oreveryotherpassof aradix-2FFT.Thisautomaticallyprovidesthe 1 √N scalingwhilekeepingthe computationsintherangewherepositshavemaximumaccuracy.Thescalingby 0.5canbeincorporatedintothetwiddlefactortabletoeliminatethecostofan extramultiplyoperation.
TheFFTisaformoftheDFTthatusesthefactthatthesummationscan berepresentedasamatrix-vectorproduct,andthematrixcanbefactoredto reducethecomputationalcomplexityto N log 2 N operations.Inthispaper,two basicclassesofFFTalgorithms,Decimation-In-Time(DIT)andDecimationIn-Frequency(DIF),willbediscussed.DITalgorithmsdecomposethetime sequencesintosuccessivelysmallersubsequenceswhileDIFalgorithmsdecomposethecoefficientsintosmallersubsequences[28].
TraditionalanalysisofthecomputationalcomplexityoftheFFTcenterson thenumberofmultiplicationsandadditions.ThekerneloperationofanFFTis oftencalleda“butterfly”becauseofitsdataflowpattern(seeFig. 3).Theoriginal radix-2algorithmperforms10operations(6additionsand4multiplications) perbutterfly[34,p.42],highlightedinredinFig. 3,andthereare 1 2 N log 2 N butterflies,sotheoperationcomplexityis5N log 2 N forlarge N .Useofradix4 reducesthisto4 5N log 2 N ;splitradixmethodsare4N log 2 N ,andwithalittle morewhittling awaythiscanbefurtherreducedtoabout3 88N log 2 N [32]. Operationcount,however,isnotthekeytoincreasingFFTperformance,since theFFTiscommunication-boundandnotcomputation-bound.
SupercomputerusersareoftensurprisedbythetinyfractionofpeakperformancetheyobtainwhentheyperformFFTswhileusinghighly-optimizedvendorlibrariesthatarewell-tunedforaparticularsystem.TheTOP500listshows
Aradix-1“butterfly”calculation
manysystemsachievingover80%oftheirpeakperformanceformultiply-add operationsinadenselinearalgebraproblem.FFTs,whichhaveonlymultiply andaddoperationsandpredetermineddataflowforagivensize N ,mightbe expectedtoachievesimilarperformances.However,traditionalcomplexityanalysis,whichcountsthenumberofoperations,doesapoorjobofpredictingactual FFTexecutiontimings.
FFTsarethustheAchillesHeelofHPC,becausetheytaxthemostvaluableresource:datamotion.Inanerawhenperformanceisalmostalways communication-boundandnotcomputation-bound,itismoresensibletooptimizethe datamotion asopposedtothe operationcount.Denselinearalgebra involvesorder N 3 operationsbutonlyorder N 2 datamotion,makingitoneof thefewHPCtasksthatisstillcompute-bound.Denselinearalgebraisoneof thefewworkloadsforwhichoperationcountcorrelateswellwithpeakarithmetic performance.WhilesomehavestudiedthecommunicationaspectsoftheFFT basedonasingleprocessorwithacachehierarchy(avalidmodelforsmall-scale digitalsignalprocessing),supercomputersareincreasinglylimitedbythelaws of physics andnotbyarchitecture.ThecommunicationcostfortheFFT(inthe limitwheredataaccessislimitedbythespeedoflightanditsphysicaldistance) isthus not order N log 2 N .
Figure 4 showsatypical(16-point)FFTdiagramwherethenodes(brown dots)andedges(linesinblue)representthedatapointsandcommunications ineachstagerespectively.ThefigureillustratesaDIT-FFT,butaDIF-FFTis simplyitsmirrorimage,sothefollowingargumentappliestoeithercase.For anyinputontheleftside,datatravelsinthe y dimensionbyabsolutedistance 1, 2, 4,..., N 2 positions,atotalmotionof N 1positions.Simplisticmodelsof performanceassumealledgesareofequaltimecost,butthisisnottrueifthe physicallimitsoncommunicationspeedareconsidered.Thetotalmotioncostof N 1positionsholdsforeachofthe N datapoints,hencethetotalcommunicationworkisorder N 2 ,thesameordercomplexityasaDFTwithoutanyclever factoring.Thisassumesmemoryisphysicallyplacedinaline.Inarealsystem likeasupercomputerclustercoveringmanysquaremeters,memoryisdistributed overaplane,forwhichtheaveragedistancebetweenlocationsisorder N 1/2 ,or inavolume,forwhichtheaveragedistanceisorder N 1/3 .Thoseconfigurations resultinphysics-limitedFFTcommunicationcomplexityoforder N 3/2 or N 4/3 respectively,bothofwhichgrowfasterwith N thandoes N log 2 N .Itispossible
Fig.3.
todoall-to-allexchangespartwaythroughtheFFTtomakecommunications localagain,butthismerelyshiftsthecommunicationcostintotheall-to-all exchange.Thus,itisnotsurprisingthatlarge-scalesupercomputersattainonly afractionoftheirpeakarithmeticperformancewhenperformingFFTs.
Thisobservationpointsustoadifferentapproach:reducethe bitsmoved, nottheoperationcount.Thecommunicationcostgrowslinearlywiththenumberofbitsusedperdataitem.Theuseofadataformat,i.e.posits,thathas considerablymoreinformation-per-bitthanIEEE754floatingpointnumbers cangenerateanswerswithacceptableaccuracyusingfewerbits.Inthefollowing section,16-bitpositswillbeusedtocomputeFFTswithhigheraccuracythan 16-bit(IEEEhalf-precision)floats,potentiallyallowing32-bitfloatstobesafely replacedby16-bitposits,doublingthespeed(byhalvingthecommunication cost)ofsignalandimageprocessingworkloads.WespeculatethatsimilarperformancedoublingispossibleforHPCworkloadsthatpresentlyuse64-bitfloats forFFTs,bymakingthemsufficientlyaccurateusing32-bitposits.
4Approach
4.1Accuracy
Totesttheeffectivenessofposits,1024-and4096-pointFFTsareusedasthe sizesmostcommonlyfoundintheliterature.Bothradix-2andradix-4methods arestudiedhere,butnotsplit-radix[9, 30].Althoughmodifiedsplit-radixhas thesmallestoperationcount(about3 88N log 2 N ),fixed-pointstudies[4]show ithaspooreraccuracythanradix-2andradix-4methods.
BothDITandDIFmethodsaretested.TheDIFapproachintroducesmultiplicativeroundingearlyintheprocessing.Intuitionsaysthismightpollutethe latercomputationsmorethantheDITmethodwherethefirstpassesperformno multiplications.Empiricaltestsareconductedtocheckthisintuitionwiththree numericalapproaches:
Fig.4. AtypicalFFTdiagram
–16-bitIEEEstandardfloats,withmaximumuseoffusedmultiply-addoperationstoreduceroundingerrorinthemultiplicationsofcomplexnumbers, –16-bitposits(eS =1)withexactlythesameoperationsasusedforfloats, and –16-bitpositsusingthequiretofurtherreduceroundingerrortoonerounding perpassoftheFFT.
Foreachofthese24combinationsofdata-pointsize,radix,decimationtype, andnumericalapproach,randomuniformdistributioninputsignalsintherange [ 1, 1)attheresolutionofa12-bitADCarecreated.The12-bitfixed-point ADCinputsarefirsttransformedintotheircorresponding16-bitpositsand floatsasshowninFig. 5.A“round-trip”(forwardfollowedbyinverse)FFTis thenappliedbeforetheresultsareconverted(withrounding)backtothe12bitADCfixed-pointformat(representedby ADC inFig. 5).Ifnoerrorsand roundingsoccur,theoriginalsignalisrecoveredperfectly,i.e. ADC = ADC
Around-tripFFTfora12-bitADC
Theabsoluteerrorofa12-bitADCinputcanthusbecomputedasshown inEq. 3.Thiserrorrepresentstheroundingerrorsincurredbypositsandfloats respectively.
Toevaluatetheaccuracyofpositsandfloats,thevectorofabsoluteerrors ofallADCinputsisevaluated.Threeflavorsofmeasures,themaximum(L∞ norm),RMS(L2 norm)andaverage(L1 norm)ofthevectorarecomputed.
Togainadditionalinsighttotheerror,theunitsinthelastplace(ULPs) metricisused.AsshownbyGoldberg[10,p.8],ULPerrormeasureissuperior torelativeerrorformeasuringpureroundingerror.ULPerrorcanbecomputed asfollowsasshowninEq. 4.Fora12-bitfixed-pointADC, ulp(ADC )isa constantvalue(2 11 forinputin[ 1, 1)range).
ULPerror = ADC ADC ulp(ADC ) (4) where ulp(ADC )isoneunitvalueofADC lastplaceunit.
Fig.5.
Inthecaseofapositsandfloats,evenifanansweriscorrectlyrounded, therewillbeanerrorofasmuchas0.5ULP.Witheachadditionalarithmetic operation,theerrorsaccumulate.Consequently,tominimizetheeffectofroundingerrors,itisimportanttominimizetheerrore.g.byusingthequiretodefer roundingasmuchaspossible.
BecauseanADChandlesonlyafiniterangeofuniformly-spacedpoints adjustedtothedynamicrangeitneedstohandle,aGaussiandistributionthatis unboundedisdeemedunsuitableforthisevaluation.Auniformly-spaceddistributionthatisboundedtotherequiredrangeisusedinstead.Preliminarytests werealsoconductedonbell-shapeddistributions(truncatedGaussiandistributions)confinedtosamerange,[ 1, 1)representingthreestandarddeviations fromthemeanat0;theyyieldedresultssimilartothosefortheuniformdistributiontestspresentedhere.
For16-bitfixed-point,werelyonanalysisbecauseitobviatesexperimentation.AftereverypassofanFFT,theFFTvaluesmustbescaledby1/2to guaranteethereisnooverflow.ForanFFTwith22k points,theresultofthe2k scalingswillbeananswerthatistoosmallbyafactorof2k ,soitmustbescaled upbyafactorof2k ,shiftingleftby k bits.Thisintroduceszerosontheright thatrevealthelossofaccuracyofthefixed-pointapproach.Thelossofaccuracy is5bitsfora1024-pointFFT,and6bitsfora4096-pointFFT.InFPGAdevelopment,itispossibletousenon-power-of-twodatasizeseasily,andfixedpoint canbemadetoyieldacceptable1024-pointFFTresultsifthedatapointshave 18bitsofprecision[25].Sincefixedpointrequiresmuchlesshardwarethanfloats (orposits),thisisanexcellentapproachforspecial-purposeFPGAdesigns.In themoregeneralcomputingenvironmentwheredatasizesarepower-of-twobits insize,aprogrammerusingfixed-pointformathaslittlechoicebuttoupsizeall thevaluesto32-bitsize.Thesamewillbeshowntruefor16-bitfloats,which cannotachieveacceptableaccuracy.
4.2Performance
Theperformanceoflarge-scaleFFTsiscommunicationbound,aspointedout inSect. 3.3.Thereductioninthesizeofoperandsnotonlyreducestimeproportionately,butalsoreduce cachespill effects.Forexample,a1024-by-1024 2D-FFTcomputedwith16-bitpositswillfitina4MBcache.IftheFFTwas performedusing32-bitfloatstoachievesufficientaccuracy,thedatawillnotfit incacheandthecache“spill”willreduceperformancedramaticallybyafactor ofmorethantwo.Thisisawell-knowneffect.
However,thereisaneedtoshowthatpositarithmeticcanbeasfastasfloat arithmetic,possiblyfaster.Otherwise,theprojectedbandwidthsavingsmightbe offsetbyslowerarithmetic.UntilVLSIprocessorsusingpositsasanativetype arecomplete,arithmeticperformancecomparisonsbetweenpositsandfloatsof thesamebitsizecanbeperformedwithsimilarimplementationsinsoftware.
Asoftwarelibrary,SoftPosit,isusedinthisstudy.Itiscloselybasedon BerkeleySoftFloat[17](Release3d)Similarimplementationandoptimization techniquesareadoptedtoenableafaircomparisonoftheperformanceof16-bit
positsversus16-bitfloats.Note:theperformanceresultsonpositsarepreliminarysinceSoftPositisanewlibrary;26yearsofoptimizationefforthavebeen putintoBerkeleySoftFloat.
Table1. Testmachinespecification
Thespecificationofthemachineusedtoevaluatetheperformanceisshown inTable 1.BothSoftPositandSoftFloatarecompiledwithGNUGCC4.8.5with optimizationlevel“O2”andarchitecturesetto“core-avx2”.
Thearithmeticoperationsofposit 16, 1 andquire128 16, 1 ,andofIEEE Standardhalf-precisionfloats(float 16, 5 )areshowninTable 2.Eachoperation isimplementedusingintegeroperatorsinC.Withtheexceptionofthefuseddot product(apositarithmeticfunctionalitythatisnotintheIEEE754Standard), thereisanequivalentpositoperationforeveryfloatoperationshowninTable 2
ThemostsignificantroundingerrorsthatinfluencetheaccuracyofDFT algorithmsoccurineachbutterflycalculation,the bfly routine.Toreducethe numberofroundingerrors,fusedmultiply-addsareleveraged.Positarithmetic canperformfusedmultiply-adds,orbetter,leveragethefuseddotproductswith thequiredatatypetofurtherreducetheaccumulationofroundingerrors.
Thetwiddlefactorsareobtainedusingaprecomputedcosinetablewith1024 pointstostorethevaluesofcos(0)tocos( π 2 ).Thesineandcosinevaluesforthe entireunitcirclearefoundthroughindexedreflectionsandnegationsofthese discretevalues.
A1D-FFTwithinputsizes1024and4096isusedtocheckthecomputationalperformancewithposit 16, 1 withoutquire,posit 16, 1 withquire,and float 16, 5 bytaggingeachruntothesameselectedcore.
5Results
5.1Accuracy
Figure 6 showstheaverageRMSerrorsfor1024-pointtests,representinghundredsofthousandsofexperimentaldatapoints.TheRMSerrorbargraphforthe 4096-pointFFTslooksverysimilarbutbarsareuniformlyabout12%higher. TheverticalaxisrepresentsUnitsintheLastPlace(ULP)attheresolutionof
Table2. ArithmeticOperations
Arithmeticoperations Posit 16, 1 functions Float 16, 5 functions
Add p16 add f16 add
Subtract p16 sub f16 sub
Multiply p16 mul f16 mul
Divide p16 div f16 div
Fusedmultiply-add p16 mulAdd f16 mulAdd
Fuseddotproduct-add q16 fdp add Notapplicable
Fuseddotproduct-sub q16 fdp sub Notapplicable
a12-bitADCasdescribedintheprevioussection.BesidestheRMSerror, L1 and L∞ errorsarecalculatedandshowanearlyidenticalpattern,differingonly intheverticalscale.
Fig.6. RMSerrorspervalue ×106 for1024-pointFFTs
Theobviousdifferenceisthat16-bitpositshaveabout1/4theroundingerror offloatswhenrunningalgorithmsthatroundatidenticalplacesinthedataflow. Thisfollowsfromthefactthatmostoftheoperationsoccurwherepositshave twobitsgreateraccuracyinthefractionpartoftheirformat.Theuseofthe quirefurtherreduceserrors,byasmuchas1.8× forthecaseoftheradix-4form oftheFFT.(Similarly,roundingerrorisabout16timeslessfor32-bitposits thanfor32-bitfloatssince32-bitpositshave28bitsinthesignificandforvalues withmagnitudesbetween1/16and16,comparedwith24bitsinthesignificand ofastandardfloat).
Theotherdifferences,betweenDITandDIF,orbetweenradix-2andradix4,aresubtlebutstatisticallysignificantandrepeatable.Forpositswithoutthe quire,radix-4isslightlylessaccuratethanradix-2becauseintermediatecalculationsinthelongerdotproductcanstrayintotheregionswherepositshave onlyonebitinsteadoftwobitsofgreaterfractionprecision.However,thequire providesastrongadvantageforradix-4,reducingthenumberofroundingevents perresulttoonly4perpointfora1024-pointFFT,and4morefortheinverse.
However,Fig. 6 understatesthesignificanceofthehigheraccuracyofposits. TheoriginalADCsignalisthe“goldstandard”forthecorrectanswer.Bothfloat orpositcomputationsmakeroundingerrorsinperformingaround-tripFFT. Additionally,weroundthatresultagain toexpressitinthefixed-pointformat usedbytheADC,asshowninFig. 5.Oncetheresultisaccuratetowithin0.5 ULPoftheADCinputformat,itwillroundtotheoriginalvaluewith no error. Butiftheresultismorethan0.5ULPfromtheoriginalADCvalue,it“fallsoff acliff”androundstothewrongfixed-pointnumber.Becauseofthisinsight,we focusonthenumberofbitswrongcomparedtotheoriginalsignal(measuredin ADCULPs)andnotmeasuressuchasRMSerrorordBsignal-to-noiseratios thataremoreappropriatewhennumericalerrorsarefrequentandpervasive.
SupposewemeasureandsumtheabsoluteULPsoferrorforeverydatapoint inaround-trip(radix-2,DIF)1024-pointFFT.Figure 7ashowsthatthemassive lossesproducedby16-bitfloatsprecludetheiruseforFFTs.
Forposits,ontheotherhand,97.9%ofthevaluesmaketheroundtripwithall bitsidenticaltotheoriginalvalue.Ofthe2.1%thatareoff,theyareoffbyonly 1ULP.Whilethereversibilityisnotmathematicallyperfect,itisnearlyso,and maybeaccurateenoughtoeliminatetheneedtouse32-bitdatarepresentation. Thebarchartshows16-bitpositstobeabout36timesasaccurateas16-bitfloats inpreservingtheinformationcontentoftheoriginaldata.Theratioissimilar fora4096-pointFFT,showninFig. 7b;theyarealmostindistinguishableexcept fortheverticalscale.
(a)Around-trip1024-pointFFT(b)Around-trip4096-pointFFT
Figure 8ashowsanotherwaytovisualizetheerror,plottingtheerrorinULPs asafunctionofthedatapoint(realparterrorsinblue,andimaginaryparterrors inorange).TheerrorsareaslargeassixULPsfromtheoriginaldata,and68%of theround-tripvaluesareincorrect.AnerrorofsixULPsrepresentsaworst-case lossof1.8decimalsofaccuracy(aone-bitlossrepresentsabout0.3decimalloss), and16-bitfloatsonlyhaveabout3.6decimalstobeginwith.Figure 8bshows theresultswhenusingpositsandthequire,withjustafewscatteredpointsthat donotlieonthe x-axis.Theinformationlossisveryslight.
Fig.7. TotalULPsoferror
(a)Floats(b)Posit+Quire
Fig.8. ULPserror,1024-pointround-tripFFT
(a)Posits(b)Floats
Fig.9. Percentround-triperrorsversusADCbitresolutionfor1024-pointFFTs
Caninformationlossbereducedto zero ?Itcan,asshowninFig. 9.Ifwewere tousealower-resolutionADC,with11or10bitsofresolution,the16-bitposit approachcanresultinperfectreversibility;nobitsarelost.Low-precisionADCs areinwidespreaduse,allthewaydowntofast2-bitconvertersthatproduce onlyvalues 1,0,or1.Theoilandgasindustryisnotoriousformovingseismic databythetruckload,literally,andtheystoretheirFFTsoflow-resolutionADC outputin32-bitfloatstoprotectagainstanylossofdatathatwasveryexpensive toacquire.Asimilarsituationholdsforradioastronomy,X-raycrystallography, magneticresonanceimaging(MRI)data,andsoon.Theuseof16-bitposits couldcutallthestorageanddatamotioncostsinhalffortheseapplications. Figure 9 showsthattheinsufficientaccuracyof16-bitfloatsforcestheuseof 32-bitfloatstoachievelosslessreversibility.
5.2Performance
Theperformanceofarithmeticoperations,add,subtract,multiplyanddivide, fromSoftPositandSoftFloatarecomputedexhaustivelybysimulatingallpossiblecombinationsoftheinputsintherange[ 1, 1]wheremostoftheFFT calculationsoccur.Fusedmultiply-addandfuseddotproductwillrequirefrom daystoweekstobeexhaustivelytestedontheselectedmachine.Consequently, thetestsaresimplifiedbyreusingfunctionarguments,i.e.restrictingthetest
torunonlytwonestedloopsexhaustivelyinsteadofthree.Tenrunsforeach operationwereperformedonaselectedcoretoeliminateperformancevariations betweencoresandremovethevariationcausedbyoperatingsysteminterrupts andotherinterference.
Intheselectedinputrange,float 16, 5 andposit 16, 1 have471951364and 1073807361bitpatternsrespectively.Thusposit 16, 1 hasmoreaccuracythan float 16, 5 .Oneofthereasonswhyfloats 16, 5 havefewerthanhalfthebit patternscomparedtoposits 16, 1 isduetothereserved2048bit-patternsfor “non-numbers”,i.e.whenallexponentbitsare1s.“Non-numbers”represent positiveandnegativeinfinityandnot-a-number(NaN).Incomparison,posits donotwastebitpatternsandhaveonlyoneNot-a-Real(NaR)bitpattern, 100...000,andonlyonerepresentationforzero,000 000.Additionally,posits havemorevaluescloseto ±1andto0thandofloats.
Theperformanceinoperationspersecondforargumentsintherangeof [ 1, 1]isshowninFig 10
Fig.10. Posit 16, 1 versusfloat 16, 5 performanceinrange[ 1, 1]
Theresultsshowthatpositshaveslightlybetterperformanceinmultiplyand divideoperationswhilefloatshaveslightlybetterperformanceinadd,subtract andFMAoperations.“FDP-add”and“FDP-sub”areadditionsandsubtractions ofproductsusingthequirewhencomputingafuseddotproduct.Theyshow higherperformancethanFMAduetothefactthatoneofthearguments,the quire,doesnotrequireadditionalshiftingbeforeadding/subtractingthedot productoftwootherpositarguments.Italsodoesnotneedtoperformrounding untilitcompletesallaccumulations,whichsavestime.Whenappropriatelyused, quirescanpotentiallyimproveperformancewhileminimizingroundingerrors.
Add Subtract Multiply Divide FMA FDP-add FDP-sub
6ConclusionsandFutureWork
Wehaveshownthat16-bitpositsoutperform16-bitfloatsandfixed-pointsin accuracyforradix-2andradix-4,1024-and4096-pointFFTsforbothDITand DIFclasses.Tohaveaccuracysimilartothatof16-bitposits,32-bitsfloats wouldhavetobeused.WhenADCinputsare11-bitsorsmaller,16-bitposits cancomputecompletelylossless“round-trip”FFTs.16-bitpositshavecomparablecomputationperformancetothatof16-bitfloats,butapproximatelytwice theperformanceonbandwidthboundtaskssuchastheFFT.Becauseposit arithmeticisstillinitsinfancy,theperformanceresultshownhereusingan in-housesoftwareemulator,SoftPosit,ispreliminary.
Whileweherestudiedexamplesfromthelow-precisionsideofHPC,the advantagesofpositsshouldalsoapplytohigh-precisionFFTapplicationssuchas abinitio computationalchemistry,radarcross-sectionanalysis,andthesolution ofpartialdifferentialequations(PDEs)byspectralmethods.Forsomeusers, positsmightbedesirabletoincreaseaccuracyusingthesameprecision,instead oftoenabletheuseofhalfasmanybitspervariable.A32-bitpositisnominally 16timesasaccurateperoperationasa32-bitfloatintermsofitsfractionbits, thoughtestsonrealapplicationsshowtheadvantagetobemorelike50times asaccurate[26],probablybecauseoftheaccumulationoferrorintime-stepping physicalsimulations.
AnotherareaforfutureFFTinvestigationistheWinogradFFTalgorithm [35, 36],whichtradesmultiplicationsforadditions.Normally,summingnumbersofsimilarmagnitudeandoppositesignmagnifiesanyrelativeerrorinthe addends,whereasmultiplicationsarealwaysaccurateto0.5ULP,soWinograd’s approachmightseemdubiousforfloatsorposits.However,therangeofvalues whereFFTstakeplacearerichinadditionsthatmakenoroundingerrors,so thisdeservesinvestigation.
ItiscrucialtonotethattheadvantageofpositsisnotlimitedtoFFTsand weexpecttoexpandexperimentalcomparisonsoffloatandpositaccuracyfor otheralgorithmssuchaslinearequationsolutionandmatrixmultiplication,for bothHPCandmachinelearning(ML)purposes.Onepromisingareafor32-bit positsisweatherprediction(whichfrequentlyreliesonFFTsthataretypically performedwith64-bitfloats).
Large-scalesimulationstypicallyachieveonlysingle-digitpercentagesofthe peakspeedofanHPCsystem,whichmeansthearithmeticunitsarespending mostoftheirtimewaitingforoperandstobecommunicatedtothem.Hardware bandwidthshavebeenincreasingveryslowlyforthelastseveraldecades,and therehasbeennosteeptrendlineforbandwidthliketherehasbeenfortransistorsizeandcost(Moore’slaw).High-accuracypositarithmeticpermitsuse ofreduceddatasizes,whichpromisestoprovidedramaticspeedupsnotjustfor FFTkernelsbutfortheverybroadrangeofbandwidth-boundapplicationsthat presentlyrelyonfloating-pointarithmetic.