Trace-basedJust-in-TimeTypeSpecializationforDynamic Languages
AndreasGal∗ +,BrendanEich∗ ,MikeShaver∗ ,DavidAnderson∗ ,DavidMandelin∗ , MohammadR.Haghighat$,BlakeKaplan∗ ,GraydonHoare∗ ,BorisZbarsky∗ ,JasonOrendorff∗ , JesseRuderman∗ ,EdwinSmith#,RickReitmaier#,MichaelBebenita+,MasonChang+#,MichaelFranz+ MozillaCorporation∗ {gal,brendan,shaver,danderson,dmandelin,mrbkap,graydon,bz,jorendorff,jruderman}@mozilla.com
AdobeCorporation# {edwsmith,rreitmai}@adobe.com
IntelCorporation$ {mohammad.r.haghighat}@intel.com
UniversityofCalifornia,Irvine+ {mbebenit,changm,franz}@uci.edu
Abstract
DynamiclanguagessuchasJavaScriptaremoredifficulttocompilethanstaticallytypedones.Sincenoconcretetypeinformation isavailable,traditionalcompilersneedtoemitgenericcodethatcan handleallpossibletypecombinationsatruntime.Wepresentanalternativecompilationtechniquefordynamically-typedlanguages thatidentifiesfrequentlyexecutedlooptracesatrun-timeandthen generatesmachinecodeontheflythatisspecializedfortheactualdynamictypesoccurringoneachpaththroughtheloop.Our methodprovidescheapinter-proceduraltypespecialization,andan elegantandefficientwayofincrementallycompilinglazilydiscoveredalternativepathsthroughnestedloops.Wehaveimplemented adynamiccompilerforJavaScriptbasedonourtechniqueandwe havemeasuredspeedupsof10xandmoreforcertainbenchmark programs.
CategoriesandSubjectDescriptors D.3.4[ProgrammingLanguages]:Processors— Incrementalcompilers,codegeneration. GeneralTerms Design,Experimentation,Measurement,Performance.
Keywords JavaScript,just-in-timecompilation,tracetrees.
1.Introduction
Dynamiclanguages suchasJavaScript,Python,andRuby,arepopularsincetheyareexpressive,accessibletonon-experts,andmake deploymentaseasyasdistributingasourcefile.Theyareusedfor smallscriptsaswellasforcomplexapplications.JavaScript,for example,isthedefactostandardforclient-sidewebprogramming
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation onthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistribute tolists,requirespriorspecificpermissionand/orafee.
PLDI’09, June15–20,2009,Dublin,Ireland. Copyright c 2009ACM978-1-60558-392-1/09/06...$5.00
andisusedfortheapplicationlogicofbrowser-basedproductivity applicationssuchasGoogleMail,GoogleDocsandZimbraCollaborationSuite.Inthisdomain,inordertoprovideafluiduser experienceandenableanewgenerationofapplications,virtualmachinesmustprovidealowstartuptimeandhighperformance.
Compilersforstaticallytypedlanguagesrelyontypeinformationtogenerateefficientmachinecode.InadynamicallytypedprogramminglanguagesuchasJavaScript,thetypesofexpressions mayvaryatruntime.Thismeansthatthecompilercannolonger easilytransformoperationsintomachineinstructionsthatoperate ononespecifictype.Withoutexacttypeinformation,thecompiler mustemitslowergeneralizedmachinecodethatcandealwithall potentialtypecombinations.Whilecompile-timestatictypeinferencemightbeabletogathertypeinformationtogenerateoptimizedmachinecode,traditionalstaticanalysisisveryexpensive andhencenotwellsuitedforthehighlyinteractiveenvironmentof awebbrowser.
Wepresentatrace-basedcompilationtechniquefordynamic languagesthatreconcilesspeedofcompilationwithexcellentperformanceofthegeneratedmachinecode.Oursystemusesamixedmodeexecutionapproach:thesystemstartsrunningJavaScriptina fast-startingbytecodeinterpreter.Astheprogramruns,thesystem identifies hot (frequentlyexecuted)bytecodesequences,records them,andcompilesthemtofastnativecode.Wecallsuchasequenceofinstructionsa trace
Unlikemethod-baseddynamiccompilers,ourdynamiccompileroperatesatthegranularityofindividualloops.Thisdesign choiceisbasedontheexpectationthatprogramsspendmostof theirtimeinhotloops.Evenindynamicallytypedlanguages,we expecthotloopstobemostly type-stable,meaningthatthetypesof valuesareinvariant.(12)Forexample,wewouldexpectloopcountersthatstartasintegerstoremainintegersforalliterations.When bothoftheseexpectationshold,atrace-basedcompilercancover theprogramexecutionwithasmallnumberoftype-specialized,efficientlycompiledtraces.
Eachcompiledtracecoversonepaththroughtheprogramwith onemappingofvaluestotypes.WhentheVMexecutesacompiled trace,itcannotguaranteethatthesamepathwillbefollowed orthatthesametypeswilloccurinsubsequentloopiterations.
Hence,recordingandcompilingatrace speculates thatthepathand typingwillbeexactlyastheywereduringrecordingforsubsequent iterationsoftheloop.
Everycompiledtracecontainsallthe guards (checks)required tovalidatethespeculation.Ifoneoftheguardsfails(ifcontrol flowisdifferent,oravalueofadifferenttypeisgenerated),the traceexits.Ifanexitbecomeshot,theVMcanrecorda branch trace startingattheexittocoverthenewpath.Inthisway,theVM recordsa tracetree coveringallthehotpathsthroughtheloop.
NestedloopscanbedifficulttooptimizefortracingVMs.In anaıveimplementation,innerloopswouldbecomehotfirst,and theVMwouldstarttracingthere.Whentheinnerloopexits,the VMwoulddetectthatadifferentbranchwastaken.TheVMwould trytorecordabranchtrace,andfindthatthetracereachesnotthe innerloopheader,buttheouterloopheader.Atthispoint,theVM couldcontinuetracinguntilitreachestheinnerloopheaderagain, thustracingtheouterloopinsideatracetreefortheinnerloop. Butthisrequirestracingacopyoftheouterloopforeverysideexit andtypecombinationintheinnerloop.Inessence,thisisaform ofunintendedtailduplication,whichcaneasilyoverflowthecode cache.Alternatively,theVMcouldsimplystoptracing,andgiveup onevertracingouterloops.
Wesolvethenestedloopproblembyrecording nestedtrace trees.Oursystemtracestheinnerloopexactlyasthenaıveversion. Thesystemstopsextendingtheinnertreewhenitreachesanouter loop,butthenitstartsanewtraceattheouterloopheader.When theouterloopreachestheinnerloopheader,thesystemtriestocall thetracetreefortheinnerloop.Ifthecallsucceeds,theVMrecords thecalltotheinnertreeaspartoftheoutertraceandfinishes theoutertraceasnormal.Inthisway,oursystemcantraceany numberofloopsnestedtoanydepthwithoutcausingexcessivetail duplication.
ThesetechniquesallowaVMtodynamicallytranslateaprogramtonested,type-specializedtracetrees.Becausetracescan crossfunctioncallboundaries,ourtechniquesalsoachievetheeffectsofinlining.Becausetraceshavenointernalcontrol-flowjoins, theycanbeoptimizedinlineartimebyasimplecompiler(10). Thus,ourtracingVMefficientlyperformsthesamekindofoptimizationsthatwouldrequireinterproceduralanalysisinastatic optimizationsetting.Thismakestracinganattractiveandeffective tooltotypespecializeevencomplexfunctioncall-richcode.
WeimplementedthesetechniquesforanexistingJavaScriptinterpreter,SpiderMonkey.WecalltheresultingtracingVM TraceMonkey.TraceMonkeysupportsalltheJavaScriptfeaturesofSpiderMonkey,witha2x-20xspeedupfortraceableprograms. Thispapermakesthefollowingcontributions:
• Weexplainanalgorithmfordynamicallyformingtracetreesto coveraprogram,representingnestedloopsasnestedtracetrees.
• Weexplainhowtospeculativelygenerateefficienttype-specialized codefortracesfromdynamiclanguageprograms.
• Wevalidateourtracingtechniquesinanimplementationbased ontheSpiderMonkeyJavaScriptinterpreter,achieving2x-20x speedupsonmanyprograms.
Theremainderofthispaperisorganizedasfollows.Section3is ageneraloverviewoftracetreebasedcompilationweusetocaptureandcompilefrequentlyexecutedcoderegions.InSection4 wedescribeourapproachofcoveringnestedloopsusinganumberofindividualtracetrees.InSection5wedescribeourtracecompilationbasedspeculativetypespecializationapproachweuse togenerateefficientmachinecodefromrecordedbytecodetraces. Ourimplementationofadynamictype-specializingcompilerfor JavaScriptisdescribedinSection6.Relatedworkisdiscussedin Section8.InSection7weevaluateourdynamiccompilerbasedon
1for(vari=2;i<100;++i){ 2if(!primes[i]) 3continue; 4for(vark=i+i;i<100;k+=i) 5primes[k]=false; 6}
Figure1.Sampleprogram:sieveofEratosthenes. primes is initializedtoanarrayof100 false valuesonentrytothiscode snippet.










Figure2. StatemachinedescribingthemajoractivitiesofTraceMonkeyandtheconditionsthatcausetransitionstoanewactivity.Inthedarkbox,TMexecutesJSascompiledtraces.Inthe lightgrayboxes,TMexecutesJSinthestandardinterpreter.White boxesareoverhead.Thus,tomaximizeperformance,weneedto maximizetimespentinthedarkestboxandminimizetimespentin thewhiteboxes.Thebestcaseisaloopwherethetypesattheloop edgearethesameasthetypesonentry–thenTMcanstayinnative codeuntiltheloopisdone.
asetofindustrybenchmarks.Thepaperendswithconclusionsin Section9andanoutlookonfutureworkispresentedinSection10.
2.Overview:ExampleTracingRun
Thissectionprovidesanoverviewofoursystembydescribing howTraceMonkeyexecutesanexampleprogram.Theexample program,showninFigure1,computesthefirst100primenumbers withnestedloops.ThenarrativeshouldbereadalongwithFigure2, whichdescribestheactivitiesTraceMonkeyperformsandwhenit transitionsbetweentheloops.
TraceMonkeyalwaysbeginsexecutingaprograminthebytecodeinterpreter.Everyloopbackedgeisapotentialtracepoint. Whentheinterpretercrossesaloopedge,TraceMonkeyinvokes the tracemonitor,whichmaydecidetorecordorexecuteanative trace.Atthestartofexecution,therearenocompiledtracesyet,so thetracemonitorcountsthenumberoftimeseachloopbackedgeis executeduntilaloopbecomes hot,currentlyafter2crossings.Note thatthewayourloopsarecompiled,theloopedgeiscrossedbefore enteringtheloop,sothesecondcrossingoccursimmediatelyafter thefirstiteration.
Hereisthesequenceofeventsbrokendownbyouterloop iteration:
v0:=ldstate[748]//loadprimesfromthetraceactivationrecord stsp[0],v0//storeprimestointerpreterstack v1:=ldstate[764]//loadkfromthetraceactivationrecord v2:=i2f(v1)//convertkfrominttodouble stsp[8],v1//storektointerpreterstack stsp[16],0//storefalsetointerpreterstack v3:=ldv0[4]//loadclasswordforprimes v4:=andv3,-4//maskoutobjectclasstagforprimes v5:=eqv4,Array//testwhetherprimesisanarray xfv5//sideexitifv5isfalse v6:=js_Array_set(v0,v2,false)//callfunctiontosetarrayelement v7:=eqv6,0//testreturnvaluefromcall xtv7//sideexitifjs_Array_setreturnsfalse.
Figure3.LIRsnippetforsampleprogram. ThisistheLIRrecordedforline5ofthesampleprograminFigure1.TheLIRencodes thesemanticsinSSAformusingtemporaryvariables.TheLIRalsoencodesallthestoresthattheinterpreterwoulddotoitsdatastack. Sometimesthesestorescanbeoptimizedawayasthestacklocationsareliveonlyonexitstotheinterpreter.Finally,theLIRrecordsguards andsideexitstoverifytheassumptionsmadeinthisrecording:that primes isanarrayandthatthecalltosetitselementsucceeds.
movedx,ebx(748)//loadprimesfromthetraceactivationrecord movedi(0),edx//(*)storeprimestointerpreterstack movesi,ebx(764)//loadkfromthetraceactivationrecord movedi(8),esi//(*)storektointerpreterstack movedi(16),0//(*)storefalsetointerpreterstack moveax,edx(4)//(*)loadobjectclasswordforprimes andeax,-4//(*)maskoutobjectclasstagforprimes cmpeax,Array//(*)testwhetherprimesisanarray jneside_exit_1//(*)sideexitifprimesisnotanarray subesp,8//bumpstackforcallalignmentconvention pushfalse//pushlastargumentforcall pushesi//pushfirstargumentforcall calljs_Array_set//callfunctiontosetarrayelement addesp,8//cleanupextrastackspace movecx,ebx//(*)createdbyregisterallocator testeax,eax//(*)testreturnvalueofjs_Array_set jeside_exit_2//(*)sideexitifcallfailed
side_exit_1: movecx,ebp(-4)//restoreecx movesp,ebp//restoreesp jmpepilog//jumptoretstatement
Figure4.x86snippetforsampleprogram. Thisisthex86codecompiledfromtheLIRsnippetinFigure3.MostLIRinstructionscompile toasinglex86instruction.Instructionsmarkedwith (*) wouldbeomittedbyanidealizedcompilerthatknewthatnoneofthesideexits wouldeverbetaken.The17instructionsgeneratedbythecompilercomparefavorablywiththe100+instructionsthattheinterpreterwould executeforthesamecodesnippet,including4indirectjumps.
i=2. Thisisthefirstiterationoftheouterloop.Theloopon lines4-5becomeshotonitsseconditeration,soTraceMonkeyentersrecordingmodeonline4.Inrecordingmode,TraceMonkey recordsthecodealongthetraceinalow-levelcompilerintermediaterepresentationwecall LIR.TheLIRtraceencodesalltheoperationsperformedandthetypesofalloperands.TheLIRtracealso encodes guards,whicharechecksthatverifythatthecontrolflow andtypesareidenticaltothoseobservedduringtracerecording. Thus,onlaterexecutions,ifandonlyifallguardsarepassed,the tracehastherequiredprogramsemantics.
TraceMonkeystopsrecordingwhenexecutionreturnstothe loopheaderorexitstheloop.Inthiscase,executionreturnstothe loopheaderonline4.
Afterrecordingisfinished,TraceMonkeycompilesthetraceto nativecodeusingtherecordedtypeinformationforoptimization. Theresultisanativecodefragmentthatcanbeenteredifthe
interpreterPCandthetypesofvaluesmatchthoseobservedwhen tracerecordingwasstarted.Thefirsttraceinourexample, T45, coverslines4and5.ThistracecanbeenteredifthePCisatline4, i and k areintegers,and primes isanobject.Aftercompiling T45, TraceMonkeyreturnstotheinterpreterandloopsbacktoline1. i=3. Nowtheloopheaderatline1hasbecomehot,soTraceMonkeystartsrecording.Whenrecordingreachesline4,TraceMonkeyobservesthatithasreachedaninnerloopheaderthatalreadyhasacompiledtrace,soTraceMonkeyattemptstonestthe innerloopinsidethecurrenttrace.Thefirststepistocalltheinner traceasasubroutine.Thisexecutesthelooponline4tocompletion andthenreturnstotherecorder.TraceMonkeyverifiesthatthecall wassuccessfulandthenrecordsthecalltotheinnertraceaspartof thecurrenttrace.Recordingcontinuesuntilexecutionreachesline 1,andatwhichpointTraceMonkeyfinishesandcompilesatrace fortheouterloop, T16.
i=4. Onthisiteration,TraceMonkeycalls T16.Because i=4,the if statementonline2istaken.Thisbranchwasnottakeninthe originaltrace,sothiscauses T16 tofailaguardandtakeasideexit. Theexitisnotyethot,soTraceMonkeyreturnstotheinterpreter, whichexecutesthecontinuestatement.
i=5. TraceMonkeycalls T16,whichinturncallsthenestedtrace T45 T16 loopsbacktoitsownheader,startingthenextiteration withouteverreturningtothemonitor.
i=6. Onthisiteration,thesideexitonline2istakenagain.This time,thesideexitbecomeshot,soatrace T23,1 isrecordedthat coversline3andreturnstotheloopheader.Thus,theendof T23,1 jumpsdirectlytothestartof T16.Thesideexitispatchedsothat onfutureiterations,itjumpsdirectlyto T23,1.
Atthispoint,TraceMonkeyhascompiledenoughtracestocover theentirenestedloopstructure,sotherestoftheprogramruns entirelyasnativecode.
3.TraceTrees
Inthissection,wedescribetraces,tracetrees,andhowtheyare formedatruntime.Althoughourtechniquesapplytoanydynamic languageinterpreter,wewilldescribethemassumingabytecode interpretertokeeptheexpositionsimple.
3.1Traces
A trace issimplyaprogrampath,whichmaycrossfunctioncall boundaries.TraceMonkeyfocuseson looptraces,thatoriginateat aloopedgeandrepresentasingleiterationthroughtheassociated loop.
Similartoanextendedbasicblock,atraceisonlyenteredat thetop,butmayhavemanyexits.Incontrasttoanextendedbasic block,atracecancontainjoinnodes.Sinceatracealwaysonly followsonesinglepaththroughtheoriginalprogram,however,join nodesarenotrecognizableassuchinatraceandhaveasingle predecessornodelikeregularnodes.
A typedtrace isatraceannotatedwithatypeforeveryvariable (includingtemporaries)onthetrace.Atypedtracealsohasanentry typemap givingtherequiredtypesforvariablesusedonthetrace beforetheyaredefined.Forexample,atracecouldhaveatypemap (x:int,b:boolean),meaningthatthetracemaybeentered onlyifthevalueofthevariable x isoftype int andthevalueof b isoftype boolean.Theentrytypemapismuchlikethesignature ofafunction.
Inthispaper,weonlydiscusstypedlooptraces,andwewill refertothemsimplyas“traces”.Thekeypropertyoftypedloop tracesisthattheycanbecompiledtoefficientmachinecodeusing thesametechniquesusedfortypedlanguages.
InTraceMonkey,tracesarerecordedintrace-flavoredSSA LIR (low-levelintermediaterepresentation).Intrace-flavoredSSA(or TSSA),phinodesappearonlyattheentrypoint,whichisreached bothonentryandvialoopedges.TheimportantLIRprimitives areconstantvalues,memoryloadsandstores(byaddressand offset),integeroperators,floating-pointoperators,functioncalls, andconditionalexits.Typeconversions,suchasintegertodouble, arerepresentedbyfunctioncalls.ThismakestheLIRusedby TraceMonkeyindependentoftheconcretetypesystemandtype conversionrulesofthesourcelanguage.TheLIRoperationsare genericenoughthatthebackendcompilerislanguageindependent. Figure3showsanexampleLIRtrace.
Bytecodeinterpreterstypicallyrepresentvaluesinavarious complexdatastructures(e.g.,hashtables)inaboxedformat(i.e., withattachedtypetagbits).Sinceatraceisintendedtorepresent efficientcodethateliminatesallthatcomplexity,ourtracesoperateonunboxedvaluesinsimplevariablesandarraysasmuchas possible.
Atracerecordsallitsintermediatevaluesinasmallactivation recordarea.Tomakevariableaccessesfastontrace,thetracealso importslocalandglobalvariablesbyunboxingthemandcopying themtoitsactivationrecord.Thus,thetracecanreadandwrite thesevariableswithsimpleloadsandstoresfromanativeactivation recording,independentlyoftheboxingmechanismusedbythe interpreter.Whenthetraceexits,theVMboxesthevaluesfrom thisnativestoragelocationandcopiesthembacktotheinterpreter structures.
Foreverycontrol-flowbranchinthesourceprogram,the recordergeneratesconditionalexitLIRinstructions.Theseinstructionsexitfromthetraceifrequiredcontrolflowisdifferentfrom whatitwasattracerecording,ensuringthatthetraceinstructions arerunonlyiftheyaresupposedto.Wecalltheseinstructions guard instructions.
Mostofourtracesrepresentloopsandendwiththespecial loop LIRinstruction.Thisisjustanunconditionalbranchtothetopof thetrace.Suchtracesreturnonlyviaguards.
Now,wedescribethekeyoptimizationsthatareperformedas partofrecordingLIR.Alloftheseoptimizationsreducecomplex dynamiclanguageconstructstosimpletypedconstructsbyspecializingforthecurrenttrace.Eachoptimizationrequiresguardinstructionstoverifytheirassumptionsaboutthestateandexitthe traceifnecessary.
Typespecialization.
AllLIRprimitivesapplytooperandsofspecifictypes.Thus, LIRtracesarenecessarilytype-specialized,andacompilercan easilyproduceatranslationthatrequiresnotypedispatches.A typicalbytecodeinterpretercarriestagbitsalongwitheachvalue, andtoperformanyoperation,mustcheckthetagbits,dynamically dispatch,maskoutthetagbitstorecovertheuntaggedvalue, performtheoperation,andthenreapplytags.LIRomitseverything excepttheoperationitself.
Apotentialproblemisthatsomeoperationscanproducevalues ofunpredictabletypes.Forexample,readingapropertyfroman objectcouldyieldavalueofanytype,notnecessarilythetype observedduringrecording.Therecorderemitsguardinstructions thatconditionallyexitiftheoperationyieldsavalueofadifferent typefromthatseenduringrecording.Theseguardinstructions guaranteethataslongasexecutionisontrace,thetypesofvalues matchthoseofthetypedtrace.WhentheVMobservesasideexit alongsuchatypeguard,anewtypedtraceisrecordedoriginating atthesideexitlocation,capturingthenewtypeoftheoperationin question.
Representationspecialization:objects. InJavaScript,name lookupsemanticsarecomplexandpotentiallyexpensivebecause theyincludefeatureslikeobjectinheritanceand eval.Toevaluate anobjectpropertyreadexpressionlike o.x,theinterpretermust searchthepropertymapof o andallofitsprototypesandparents. Propertymapscanbeimplementedwithdifferentdatastructures (e.g.,per-objecthashtablesorsharedhashtables),sothesearch processalsomustdispatchontherepresentationofeachobject foundduringsearch.TraceMonkeycansimplyobservetheresultof thesearchprocessandrecordthesimplestpossibleLIRtoaccess thepropertyvalue.Forexample,thesearchmightfindsthevalueof o.x intheprototypeof o,whichusesasharedhash-tablerepresentationthatplaces x inslot2ofapropertyvector.Thentherecorded cangenerateLIRthatreads o.x withjusttwoorthreeloads:oneto gettheprototype,possiblyonetogetthepropertyvaluevector,and onemoretogetslot2fromthevector.Thisisavastsimplification andspeedupcomparedtotheoriginalinterpretercode.Inheritance relationshipsandobjectrepresentationscanchangeduringexecution,sothesimplifiedcoderequiresguardinstructionsthatensure theobjectrepresentationisthesame.InTraceMonkey,objects’rep-
resentationsareassignedanintegerkeycalledthe objectshape. Thus,theguardisasimpleequalitycheckontheobjectshape.
Representationspecialization:numbers. JavaScripthasno integertype,onlyaNumbertypethatisthesetof64-bitIEEE754floating-pointernumbers(“doubles”).ButmanyJavaScript operators,inparticulararrayaccessesandbitwiseoperators,really operateonintegers,sotheyfirstconvertthenumbertoaninteger, andthenconvertanyintegerresultbacktoadouble.1 Clearly,a JavaScriptVMthatwantstobefastmustfindawaytooperateon integersdirectlyandavoidtheseconversions.
InTraceMonkey,wesupporttworepresentationsfornumbers: integersanddoubles.Theinterpreterusesintegerrepresentations asmuchasitcan,switchingforresultsthatcanonlyberepresented asdoubles.Whenatraceisstarted,somevaluesmaybeimported andrepresentedasintegers.Someoperationsonintegersrequire guards.Forexample,addingtwointegerscanproduceavaluetoo largefortheintegerrepresentation.
Functioninlining. LIRtracescancrossfunctionboundaries ineitherdirection,achievingfunctioninlining.Moveinstructions needtoberecordedforfunctionentryandexittocopyarguments inandreturnvaluesout.Thesemovestatementsarethenoptimized awaybythecompilerusingcopypropagation.Inordertobeable toreturntotheinterpreter,thetracemustalsogenerateLIRto recordthatacallframehasbeenenteredandexited.Theframe entryandexitLIRsavesjustenoughinformationtoallowthe intepretercallstacktoberestoredlaterandismuchsimplerthan theinterpreter’sstandardcallcode.Ifthefunctionbeingentered isnotconstant(whichinJavaScriptincludesanycallbyfunction name),therecordermustalsoemitLIRtoguardthatthefunction isthesame.
Guardsandsideexits. Eachoptimizationdescribedabove requiresoneormoreguardstoverifytheassumptionsmadein doingtheoptimization.AguardisjustagroupofLIRinstructions thatperformsatestandconditionalexit.Theexitbranchestoa sideexit,asmalloff-tracepieceofLIRthatreturnsapointerto astructurethatdescribesthereasonfortheexitalongwiththe interpreterPCattheexitpointandanyotherdataneededtorestore theinterpreter’sstatestructures.
Aborts. SomeconstructsaredifficulttorecordinLIRtraces. Forexample, eval orcallstoexternalfunctionscanchangethe programstateinunpredictableways,makingitdifficultforthe tracertoknowthecurrenttypemapinordertocontinuetracing. Atracingimplementationcanalsohaveanynumberofotherlimitations,e.g.,asmall-memorydevicemaylimitthelengthoftraces. Whenanysituationoccursthatpreventstheimplementationfrom continuingtracerecording,theimplementation aborts tracerecordingandreturnstothetracemonitor.
3.2TraceTrees
Especiallysimpleloops,namelythosewherecontrolflow,value types,valuerepresentations,andinlinedfunctionsareallinvariant, canberepresentedbyasingletrace.Butmostloopshaveatleast somevariation,andsotheprogramwilltakesideexitsfromthe maintrace.Whenasideexitbecomeshot,TraceMonkeystartsa new branchtrace fromthatpointandpatchesthesideexittojump directlytothattrace.Inthisway,asingletraceexpandsondemand toasingle-entry,multiple-exit tracetree
Thissectionexplainshowtracetreesareformedduringexecution.Thegoalistoformtracetreesduringexecutionthatcoverall thehotpathsoftheprogram.
1 Arraysareactuallyworsethanthis:iftheindexvalueisanumber,itmust beconvertedfromadoubletoastringforthepropertyaccessoperator,and thentoanintegerinternallytothearrayimplementation.
Startingatree. Treetreesalwaysstartatloopheaders,because theyareanaturalplacetolookforhotpaths.InTraceMonkey,loop headersareeasytodetect–thebytecodecompilerensuresthata bytecodeisaloopheaderiffitisthetargetofabackwardbranch. TraceMonkeystartsatreewhenagivenloopheaderhasbeenexecutedacertainnumberoftimes(2inthecurrentimplementation). Startingatreejustmeansstartingrecordingatraceforthecurrent pointandtypemapandmarkingthetraceastherootofatree.Each treeisassociatedwithaloopheaderandtypemap,sotheremaybe severaltreesforagivenloopheader.
Closingtheloop. Tracerecordingcanendinseveralways. Ideally,thetracereachestheloopheaderwhereitstartedwith thesametypemapasonentry.Thisiscalleda type-stable loop iteration.Inthiscase,theendofthetracecanjumprighttothe beginning,asallthevaluerepresentationsareexactlyasneededto enterthetrace.Thejumpcanevenskiptheusualcodethatwould copyoutthestateattheendofthetraceandcopyitbackintothe traceactivationrecordtoenteratrace.
Incertaincasesthetracemightreachtheloopheaderwitha differenttypemap.Thisscenarioissometimeobservedforthefirst iterationofaloop.Somevariablesinsidetheloopmightinitiallybe undefined,beforetheyaresettoaconcretetypeduringthefirstloop iteration.Whenrecordingsuchaniteration,therecordercannot linkthetracebacktoitsownloopheadersinceitis type-unstable Instead,theiterationisterminatedwithasideexitthatwillalways failandreturntotheinterpreter.Atthesametimeanewtraceis recordedwiththenewtypemap.Everytimeanadditionaltypeunstabletraceisaddedtoaregion,itsexittypemapiscomparedto theentrymapofallexistingtracesincasetheycomplementeach other.Withthisapproachweareabletocovertype-unstableloop iterationsaslongtheyeventuallyformastableequilibrium.
Finally,thetracemightexittheloopbeforereachingtheloop header,forexamplebecauseexecutionreachesa break or return statement.Inthiscase,theVMsimplyendsthetracewithanexit tothetracemonitor.
Asmentionedpreviously,wemayspeculativelychosetorepresentcertainNumber-typedvaluesasintegersontrace.Wedoso whenweobservethatNumber-typedvariablescontainaninteger valueattraceentry.Ifduringtracerecordingthevariableisunexpectedlyassignedanon-integervalue,wehavetowidenthetype ofthevariabletoadouble.Asaresult,therecordedtracebecomes inherentlytype-unstablesinceitstartswithanintegervaluebut endswithadoublevalue.Thisrepresentsamis-speculation,since attraceentrywespecializedtheNumber-typedvaluetoaninteger, assumingthatattheloopedgewewouldagainfindanintegervalue inthevariable,allowingustoclosetheloop.Toavoidfuturespeculativefailuresinvolvingthisvariable,andtoobtainatype-stable tracewenotethefactthatthevariableinquestionasbeenobserved tosometimesholdnon-integervaluesinanadvisorydatastructure whichwecallthe“oracle”.
Whencompilingloops,weconsulttheoraclebeforespecializingvaluestointegers.Speculationtowardsintegersisperformed onlyifnoadverseinformationisknowntotheoracleaboutthat particularvariable.Wheneverweaccidentallycompilealoopthat istype-unstableduetomis-speculationofaNumber-typedvariable,weimmediatelytriggertherecordingofanewtrace,which basedonthenowupdatedoracleinformationwillstartwithadoublevalueandthusbecometypestable.
Extendingatree. Sideexitsleadtodifferentpathsthrough theloop,orpathswithdifferenttypesorrepresentations.Thus,to completelycovertheloop,theVMmustrecordtracesstartingatall sideexits.Thesetracesarerecordedmuchlikeroottraces:thereis acounterforeachsideexit,andwhenthecounterreachesahotness threshold,recordingstarts.Recordingstopsexactlyasfortheroot trace,usingtheloopheaderoftheroottraceasthetargettoreach.
Ourimplementationdoesnotextendatallsideexits.Itextends onlyifthesideexitisforacontrol-flowbranch,andonlyiftheside exitdoesnotleavetheloop.Inparticularwedonotwanttoextend atracetreealongapaththatleadstoanouterloop,becausewe wanttocoversuchpathsinanoutertreethroughtree nesting
3.3Blacklisting
Sometimes,aprogramfollowsapaththatcannotbecompiled intoatrace,usuallybecauseoflimitationsintheimplementation. TraceMonkeydoesnotcurrentlysupportrecordingthrowingand catchingofarbitraryexceptions.Thisdesigntradeoffwaschosen, becauseexceptionsareusuallyrareinJavaScript.However,ifa programoptstouseexceptionsintensively,wewouldsuddenly incurapunishingruntimeoverheadifwerepeatedlytrytorecord atraceforthispathandrepeatedlyfailtodoso,sinceweabort tracingeverytimeweobserveanexceptionbeingthrown.
Asaresult,ifahotloopcontainstracesthatalwaysfail,theVM couldpotentiallyrunmuchmoreslowlythanthebaseinterpreter: theVMrepeatedlyspendstimetryingtorecordtraces,butisnever abletorunany.Toavoidthisproblem,whenevertheVMisabout tostarttracing,itmusttrytopredictwhetheritwillfinishthetrace.
Ourpredictionalgorithmisbasedon blacklisting tracesthat havebeentriedandfailed.WhentheVMfailstofinishatracestartingatagivenpoint,theVMrecordsthatafailurehasoccurred.The VMalsosetsacountersothatitwillnottrytorecordatracestarting atthatpointuntilitispassedafewmoretimes(32inourimplementation).This backoff countergivestemporaryconditionsthat preventtracingachancetoend.Forexample,aloopmaybehave differentlyduringstartupthanduringitssteady-stateexecution.Afteragivennumberoffailures(2inourimplementation),theVM marksthefragmentasblacklisted,whichmeanstheVMwillnever againstartrecordingatthatpoint.
Afterimplementingthisbasicstrategy,weobservedthatfor smallloopsthatgetblacklisted,thesystemcanspendanoticeable amountoftimejustfindingtheloopfragmentanddeterminingthat ithasbeenblacklisted.Wenowavoidthatproblembypatchingthe bytecode.Wedefineanextrano-opbytecodethatindicatesaloop header.TheVMcallsintothetracemonitoreverytimetheinterpreterexecutesaloopheaderno-op.Toblacklistafragment,we simplyreplacetheloopheaderno-opwitharegularno-op.Thus, theinterpreterwillneveragainevencallintothetracemonitor.
Thereisarelatedproblemwehavenotyetsolved,whichoccurs whenaloopmeetsalloftheseconditions:
• TheVMcanformatleastoneroottracefortheloop.
• ThereisatleastonehotsideexitforwhichtheVMcannot completeatrace.
• Theloopbodyisshort.
Inthiscase,theVMwillrepeatedlypasstheloopheader,search foratrace,findit,executeit,andfallbacktotheinterpreter. Withashortloopbody,theoverheadoffindingandcallingthe traceishigh,andcausesperformancetobeevenslowerthanthe basicinterpreter.Sofar,inthissituationwehaveimprovedthe implementationsothattheVMcancompletethebranchtrace. Butitishardtoguaranteethatthissituationwillneverhappen. Asfuturework,thissituationcouldbeavoidedbydetectingand blacklistingloopsforwhichtheaveragetracecallexecutesfew bytecodesbeforereturningtotheinterpreter.
4.NestedTraceTreeFormation
Figure7showsbasictracetreecompilation(11)appliedtoanested loopwheretheinnerloopcontainstwopaths.Usually,theinner loop(withheaderat i2)becomeshotfirst,andatracetreeisrooted atthatpoint.Forexample,thefirstrecordedtracemaybeacycle
Figure5. Atreewithtwotraces,atrunktraceandonebranch trace.Thetrunktracecontainsaguardtowhichabranchtracewas attached.Thebranchtracecontainaguardthatmayfailandtrigger asideexit.Boththetrunkandthebranchtraceloopbacktothetree anchor,whichisthebeginningofthetracetree.
Figure6. Wehandletype-unstableloopsbyallowingtracesto compilethatcannotloopbacktothemselvesduetoatypemismatch.Assuchtracesaccumulate,weattempttoconnecttheirloop edgestoformgroupsoftracetreesthatcanexecutewithouthaving toside-exittotheinterpretertocoveroddtypecases.Thisisparticularlyimportantfornestedtracetreeswhereanoutertreetriesto callaninnertree(orinthiscaseaforestofinnertrees),sinceinner loopsfrequentlyhaveinitiallyundefinedvalueswhichchangetype toaconcretevalueafterthefirstiteration.
throughtheinnerloop, {i2,i3,i5,α}.The α symbolisusedto indicatethatthetraceloopsbackthetreeanchor.
Whenexecutionleavestheinnerloop,thebasicdesignhastwo choices.First,thesystemcanstoptracingandgiveuponcompiling theouterloop,clearlyanundesirablesolution.Theotherchoiceis tocontinuetracing,compilingtracesfortheouterloopinsidethe innerloop’stracetree.
Forexample,theprogrammightexitat i5 andrecordabranch tracethatincorporatestheouterloop: {i5,i7,i1,i6,i7,i1,α} Later,theprogrammighttaketheotherbranchat i2 andthen exit,recordinganotherbranchtraceincorporatingtheouterloop: {i2,i4,i5,i7,i1,i6,i7,i1,α}.Thus,theouterloopisrecordedand compiledtwice,andbothcopiesmustberetainedinthetracecache.
Figure7. Controlflowgraphofanestedloopwithanifstatement insidetheinnermostloop(a).Aninnertreecapturestheinner loop,andisnestedinsideanoutertreewhich“calls”theinnertree. Theinnertreereturnstotheoutertreeonceitexitsalongitsloop conditionguard(b).
Ingeneral,ifloopsarenestedtodepth k,andeachloophas n paths (ongeometricaverage),thisnaıvestrategyyields O(n k) traces, whichcaneasilyfillthetracecache.
Inordertoexecuteprogramswithnestedloopsefficiently,a tracingsystemneedsatechniqueforcoveringthenestedloopswith nativecodewithoutexponentialtraceduplication.
4.1NestingAlgorithm
Thekeyinsightisthatifeachloopisrepresentedbyitsowntrace tree,thecodeforeachloopcanbecontainedonlyinitsowntree, andouterlooppathswillnotbeduplicated.Anotherkeyfactisthat wearenottracingarbitrarybytecodesthatmighthaveirreduceable controlflowgraphs,butratherbytecodesproducedbyacompiler foralanguagewithstructuredcontrolflow.Thus,giventwoloop edges,thesystemcaneasilydeterminewhethertheyarenested andwhichistheinnerloop.Usingthisknowledge,thesystemcan compileinnerandouterloopsseparately,andmaketheouterloop’s traces call theinnerloop’stracetree.
Thealgorithmforbuildingnestedtracetreesisasfollows.We starttracingatloopheadersexactlyasinthebasictracingsystem. Whenweexitaloop(detectedbycomparingtheinterpreterPC withtherangegivenbytheloopedge),westopthetrace.The keystepofthealgorithmoccurswhenwearerecordingatrace forloop LR (R forloopbeingrecorded)andwereachtheheader ofadifferentloop LO (O forotherloop).Notethat LO mustbean innerloopof LR becausewestopthetracewhenweexitaloop.
• If LO hasatype-matchingcompiledtracetree,wecall LO as anestedtracetree.Ifthecallsucceeds,thenwerecordthecall inthetracefor LR.Onfutureexecutions,thetracefor LR will calltheinnertracedirectly.
• If LO doesnothaveatype-matchingcompiledtracetreeyet, wehavetoobtainitbeforeweareabletoproceed.Inorder todothis,wesimplyabortrecordingthefirsttrace.Thetrace monitorwillseetheinnerloopheader,andwillimmediately startrecordingtheinnerloop. 2
Ifalltheloopsinanestaretype-stable,thenloopnestingcreates noduplication.Otherwise,ifloopsarenestedtoadepth k,andeach
2 Insteadofabortingtheouterrecording,wecouldprincipallymerelysuspendtherecording,butthatwouldrequiretheimplementationtobeable torecordseveraltracessimultaneously,complicatingtheimplementation, whilesavingonlyafewiterationsintheinterpreter.
Figure8. Controlflowgraphofaloopwithtwonestedloops(left) anditsnestedtracetreeconfiguration(right).Theoutertreecalls thetwoinnernestedtracetreesandplacesguardsattheirsideexit locations.
loopisenteredwith m differenttypemaps(ongeometricaverage), thenwecompile O(m k) copiesoftheinnermostloop.Aslongas m iscloseto1,theresultingtracetreeswillbetractable.
Animportantdetailisthatthecalltotheinnertracetreemustact likeafunctioncallsite:itmustreturntothesamepointeverytime. Thegoalofnestingistomakeinnerandouterloopsindependent; thuswhentheinnertreeiscalled,itmustexittothesamepoint intheoutertreeeverytimewiththesametypemap.Becausewe cannotactuallyguaranteethisproperty,wemustguardonitafter thecall,andsideexitifthepropertydoesnothold.Acommon reasonfortheinnertreenottoreturntothesamepointwould beiftheinnertreetookanewsideexitforwhichithadnever compiledatrace.Atthispoint,theinterpreterPCisintheinner tree,sowecannotcontinuerecordingorexecutingtheoutertree. Ifthishappensduringrecording,weaborttheoutertrace,togive theinnertreeachancetofinishgrowing.Afutureexecutionofthe outertreewouldthenbeabletoproperlyfinishandrecordacallto theinnertree.Ifaninnertreesideexithappensduringexecutionof acompiledtracefortheoutertree,wesimplyexittheoutertrace andstartrecordinganewbranchintheinnertree.
4.2BlacklistingwithNesting
Theblacklistingalgorithmneedsmodificationtoworkwellwith nesting.Theproblemisthatouterlooptracesoftenabortduring startup(becausetheinnertreeisnotavailableortakesasideexit), whichwouldleadtotheirbeingquicklyblacklistedbythebasic algorithm.
Thekeyobservationisthatwhenanoutertraceabortsbecause theinnertreeisnotready,thisisprobablyatemporarycondition. Thus,weshouldnotcountsuchabortstowardblacklistingaslong asweareabletobuildupmoretracesfortheinnertree.
Inourimplementation,whenanoutertreeabortsontheinner tree,weincrementtheoutertree’sblacklistcounterasusualand backoffoncompilingit.Whentheinnertreefinishesatrace,we decrementtheblacklistcounterontheouterloop,“forgiving”the outerloopforabortingpreviously.Wealsoundothebackoffsothat theoutertreecanstartimmediatelytryingtocompilethenexttime wereachit.
5.TraceTreeOptimization
Thissectionexplainshowarecordedtraceistranslatedtoan optimizedmachinecodetrace.Thetracecompilationsubsystem, NANOJIT,isseparatefromtheVMandcanbeusedforother applications.
5.1Optimizations
BecausetracesareinSSAformandhavenojoinpointsor φnodes,certainoptimizationsareeasytoimplement.Inorderto getgoodstartupperformance,theoptimizationsmustrunquickly, sowechoseasmallsetofoptimizations.Weimplementedthe optimizationsaspipelinedfilterssothattheycanbeturnedonand offindependently,andyetallruninjusttwolooppassesoverthe trace:oneforwardandonebackward.
EverytimethetracerecorderemitsaLIRinstruction,theinstructionisimmediatelypassedtothefirstfilterintheforward pipeline.Thus,forwardfilteroptimizationsareperformedasthe traceisrecorded.Eachfiltermaypasseachinstructiontothenext filterunchanged,writeadifferentinstructiontothenextfilter,or writenoinstructionatall.Forexample,theconstantfoldingfilter canreplaceamultiplyinstructionlike v13 := mul3, 1000 witha constantinstruction v13 =3000
Wecurrentlyapplyfourforwardfilters:
• OnISAswithoutfloating-pointinstructions,asoft-floatfilter convertsfloating-pointLIRinstructionstosequencesofinteger instructions.
• CSE(constantsubexpressionelimination),
• expressionsimplification,includingconstantfoldingandafew algebraicidentities(e.g., a a =0),and
• sourcelanguagesemantic-specificexpressionsimplification, primarilyalgebraicidentitiesthatallow DOUBLE tobereplaced with INT.Forexample,LIRthatconvertsan INT toa DOUBLE andthenbackagainwouldberemovedbythisfilter.
Whentracerecordingiscompleted,nanojitrunsthebackward optimizationfilters.Theseareusedforoptimizationsthatrequire backwardprogramanalysis.Whenrunningthebackwardfilters, nanojitreadsoneLIRinstructionatatime,andthereadsarepassed throughthepipeline.
Wecurrentlyapplythreebackwardfilters:
• Deaddata-stackstoreelimination.TheLIRtraceencodesmany storestolocationsintheinterpreterstack.Butthesevaluesare neverreadbackbeforeexitingthetrace(bytheinterpreteror anothertrace).Thus,storestothestackthatareoverwritten beforethenextexitaredead.Storestolocationsthatareoff thetopoftheinterpreterstackatfutureexitsarealsodead.
• Deadcall-stackstoreelimination.Thisisthesameoptimization asabove,exceptappliedtotheinterpreter’scallstackusedfor functioncallinlining.
• Deadcodeelimination.Thiseliminatesanyoperationthat storestoavaluethatisneverused.
AfteraLIRinstructionissuccessfullyread(“pulled”)from thebackwardfilterpipeline,nanojit’scodegeneratoremitsnative machineinstruction(s)forit.
5.2RegisterAllocation
Weuseasimplegreedyregisterallocatorthatmakesasingle backwardpassoverthetrace(itisintegratedwiththecodegenerator).Bythetimetheallocatorhasreachedaninstructionlike v3 = addv1,v2,ithasalreadyassignedaregisterto v3.If v1 and v2 havenotyetbeenassignedregisters,theallocatorassignsafree registertoeach.Iftherearenofreeregisters,avalueisselectedfor spilling.Weuseaclassheuristicthatselectsthe“oldest”registercarriedvalue(6).
Theheuristicconsiderstheset R ofvalues v inregistersimmediatelyafterthecurrentinstructionforspilling.Let vm bethelast instructionbeforethecurrentwhereeach v isreferredto.Thenthe
Tag JSType Description
xx1 number 31-bitintegerrepresentation
000 object pointertoJSObjecthandle
010 number pointertodoublehandle
100 string pointertoJSStringhandle
110 boolean enumerationfornull,undefined,true,false null,or undefined
Figure9.TaggedvaluesintheSpiderMonkeyJSinterpreter. Testingtags,unboxing(extractingtheuntaggedvalue)andboxing (creatingtaggedvalues)aresignificantcosts.Avoidingthesecosts isakeybenefitoftracing.
heuristicselects v withminimum vm.Themotivationisthatthis freesuparegisterforaslongaspossiblegivenasinglespill.
Ifweneedtospillavalue vs atthispoint,wegeneratethe restorecodejustafterthecodeforthecurrentinstruction.The correspondingspillcodeisgeneratedjustafterthelastpointwhere vs wasused.Theregisterthatwasassignedto vs ismarkedfreefor theprecedingcode,becausethatregistercannowbeusedfreely withoutaffectingthefollowingcode
6.Implementation
Todemonstratetheeffectivenessofourapproach,wehaveimplementedatrace-baseddynamiccompilerfortheSpiderMonkey JavaScriptVirtualMachine(4).SpiderMonkeyistheJavaScript VMembeddedinMozilla’sFirefoxopen-sourcewebbrowser(2), whichisusedbymorethan200millionusersworld-wide.Thecore ofSpiderMonkeyisabytecodeinterpreterimplementedinC++.
InSpiderMonkey,allJavaScriptvaluesarerepresentedbythe type jsval.A jsval ismachinewordinwhichuptothe3ofthe leastsignificantbitsareatypetag,andtheremainingbitsaredata. SeeFigure6fordetails.Allpointerscontainedin jsvals pointto GC-controlledblocksalignedon8-byteboundaries.
JavaScript object valuesaremappingsofstring-valuedproperty namestoarbitraryvalues.Theyarerepresentedinoneoftwoways inSpiderMonkey.Mostobjectsarerepresentedbyasharedstructuraldescription,calledthe objectshape,thatmapspropertynames toarrayindexesusingahashtable.Theobjectstoresapointerto theshapeandthearrayofitsownpropertyvalues.Objectswith large,uniquesetsofpropertynamesstoretheirpropertiesdirectly inahashtable.
Thegarbagecollectorisanexact,non-generational,stop-theworldmark-and-sweepcollector.
IntherestofthissectionwediscusskeyareasoftheTraceMonkeyimplementation.
6.1CallingCompiledTraces
Compiledtracesarestoredina tracecache,indexedbyintepreter PCandtypemap.Tracesarecompiledsothattheymaybe calledasfunctionsusingstandardnativecallingconventions(e.g., FASTCALL onx86).
Theinterpretermusthitaloopedgeandenterthemonitorin ordertocallanativetraceforthefirsttime.Themonitorcomputes thecurrenttypemap,checksthetracecacheforatraceforthe currentPCandtypemap,andifitfindsone,executesthetrace.
Toexecuteatrace,themonitormustbuildatraceactivation recordcontainingimportedlocalandglobalvariables,temporary stackspace,andspaceforargumentstonativecalls.Thelocaland globalvaluesarethencopiedfromtheinterpreterstatetothetrace activationrecord.Then,thetraceiscalledlikeanormalCfunction pointer.
Whenatracecallreturns,themonitorrestorestheinterpreter state.First,themonitorchecksthereasonforthetraceexitand appliesblacklistingifneeded.Then,itpopsorsynthesizesinterpreterJavaScriptcallstackframesasneeded.Finally,itcopiesthe importedvariablesbackfromthetraceactivationrecordtotheinterpreterstate.
Atleastinthecurrentimplementation,thesestepshaveanonnegligibleruntimecost,sominimizingthenumberofinterpreterto-traceandtrace-to-interpretertransitionsisessentialforperformance.(seealsoSection3.3).Ourexperiments(seeFigure12) showthatforprogramswecantracewellsuchtransitionshappeninfrequentlyandhencedonotcontributesignificantlytototal runtime.Inafewprograms,wherethesystemispreventedfrom recordingbranchtracesforhotsideexitsbyaborts,thiscostcan risetoupto10%oftotalexecutiontime.
6.2TraceStitching
Transitionsfromatracetoabranchtraceatasideexitavoidthe costsofcallingtracesfromthemonitor,inafeaturecalled trace stitching.Atasideexit,theexitingtraceonlyneedstowritelive register-carriedvaluesbacktoitstraceactivationrecord.Inourimplementation,identicaltypemapsyieldidenticalactivationrecord layouts,sothetraceactivationrecordcanbereusedimmediately bythebranchtrace.
Inprogramswithbranchytracetreeswithsmalltraces,trace stitchinghasanoticeablecost.Althoughwritingtomemoryand thensoonreadingbackwouldbeexpectedtohaveahighL1 cachehitrate,forsmalltracestheincreasedinstructioncounthas anoticeablecost.Also,ifthewritesandreadsareveryclose inthedynamicinstructionstream,wehavefoundthatcurrent x86processorsoftenincurpenaltiesof6cyclesormore(e.g.,if theinstructionsusedifferentbaseregisterswithequalvalues,the processormaynotbeabletodetectthattheaddressesarethesame rightaway).
Thealternatesolutionistorecompileanentiretracetree,thus achievinginter-traceregisterallocation(10).Thedisadvantageis thattreerecompilationtakestimequadraticinthenumberoftraces. Webelievethatthecostofrecompilingatracetreeeverytime abranchisaddedwouldbeprohibitive.Thatproblemmightbe mitigatedbyrecompilingonlyatcertainpoints,oronlyforvery hot,stabletrees.
Inthefuture,multicorehardwareisexpectedtobecommon, makingbackgroundtreerecompilationattractive.Inacloselyrelatedproject(13)backgroundrecompilationyieldedspeedupsof upto1.25xonbenchmarkswithmanybranchtraces.Weplanto applythistechniquetoTraceMonkeyasfuturework.
6.3TraceRecording
ThejobofthetracerecorderistoemitLIRwithidenticalsemantics tothecurrentlyrunninginterpreterbytecodetrace.Agoodimplementationshouldhavelowimpactonnon-tracinginterpreterperformanceandaconvenientwayforimplementerstomaintainsemanticequivalence.
Inourimplementation,theonlydirectmodificationtotheinterpreterisacalltothetracemonitoratloopedges.Inourbenchmark results(seeFigure12)thetotaltimespentinthemonitor(forall activities)isusuallylessthan5%,soweconsidertheinterpreter impactrequirementmet.Incrementingtheloophitcounterisexpensivebecauseitrequiresustolookuptheloopinthetracecache, butwehavetunedourloopstobecomehotandtraceveryquickly (ontheseconditeration).Thehitcounterimplementationcouldbe improved,whichmightgiveusasmallincreaseinoverallperformance,aswellasmoreflexibilitywithtuninghotnessthresholds. Oncealoopisblacklistedwenevercallintothetracemonitorfor thatloop(seeSection3.3).
Recordingisactivatedbyapointerswapthatsetstheinterpreter’sdispatchtabletocallasingle“interrupt”routineforeverybytecode.Theinterruptroutinefirstcallsabytecode-specific recordingroutine.Then,itturnsoffrecordingifnecessary(e.g., thetraceended).Finally,itjumpstothestandardinterpreterbytecodeimplementation.Somebytecodeshaveeffectsonthetypemap thatcannotbepredictedbeforeexecutingthebytecode(e.g.,calling String.charCodeAt,whichreturnsanintegeror NaN ifthe indexargumentisoutofrange).Forthese,wearrangefortheinterpretertocallintotherecorderagainafterexecutingthebytecode. Sincesuchhooksarerelativelyrare,weembedthemdirectlyinto theinterpreter,withanadditionalruntimechecktoseewhethera recorderiscurrentlyactive.
Whileseparatingtheinterpreterfromtherecorderreducesindividualcodecomplexity,italsorequirescarefulimplementationand extensivetestingtoachievesemanticequivalence.
InsomecasesachievingthisequivalenceisdifficultsinceSpiderMonkeyfollowsa fat-bytecode design,whichwasfoundtobe beneficialtopureinterpreterperformance.
Infat-bytecodedesigns,individualbytecodescanimplement complexprocessing(e.g.,the getprop bytecode,whichimplementsfullJavaScriptpropertyvalueaccess,includingspecialcases forcachedanddensearrayaccess).
Fatbytecodeshavetwoadvantages:fewerbytecodesmeans lowerdispatchcost,andbiggerbytecodeimplementationsgivethe compilermoreopportunitiestooptimizetheinterpreter.
FatbytecodesareaproblemforTraceMonkeybecausethey requiretherecordertoreimplementthesamespecialcaselogic inthesameway.Also,theadvantagesarereducedbecause(a) dispatchcostsareeliminatedentirelyincompiledtraces,(b)the tracescontainonlyonespecialcase,nottheinterpreter’slarge chunkofcode,and(c)TraceMonkeyspendslesstimerunningthe baseinterpreter.
Onewaywehavemitigatedtheseproblemsisbyimplementing certaincomplexbytecodesintherecorderassequencesofsimple bytecodes.Expressingtheoriginalsemanticsthiswayisnottoodifficult,andrecordingsimplebytecodesismucheasier.Thisenables ustoretaintheadvantagesoffatbytecodeswhileavoidingsomeof theirproblemsfortracerecording.Thisisparticularlyeffectivefor fatbytecodesthatrecursebackintotheinterpreter,forexampleto convertanobjectintoaprimitivevaluebyinvokingawell-known methodontheobject,sinceitletsusinlinethisfunctioncall.
Itisimportanttonotethatwesplitfatopcodesintothinneropcodesonlyduringrecording.Whenrunningpurelyinterpretatively (i.e.codethathasbeenblacklisted),theinterpreterdirectlyandefficientlyexecutesthefatopcodes.
6.4Preemption
SpiderMonkey,likemanyVMs,needstopreempttheuserprogram periodically.Themainreasonsaretopreventinfinitelylooping scriptsfromlockingupthehostsystemandtoscheduleGC.
Intheinterpreter,thishadbeenimplementedbysettinga“preemptnow”flagthatwascheckedoneverybackwardjump.This strategycarriedoverintoTraceMonkey:theVMinsertsaguardon thepreemptionflagateveryloopedge.Wemeasuredlessthana 1%increaseinruntimeonmostbenchmarksforthisextraguard. Inpractice,thecostisdetectableonlyforprogramswithveryshort loops.
Wetestedandrejectedasolutionthatavoidedtheguardsby compilingtheloopedgeasanunconditionaljump,andpatching thejumptargettoanexitroutinewhenpreemptionisrequired. Thissolutioncanmakethenormalcaseslightlyfaster,butthen preemptionbecomesveryslow.Theimplementationwasalsovery complex,especiallytryingtorestartexecutionafterthepreemption.
6.5CallingExternalFunctions
Likemostinterpreters,SpiderMonkeyhasaforeignfunctioninterface(FFI)thatallowsittocallCbuiltinsandhostsystemfunctions (e.g.,webbrowsercontrolandDOMaccess).TheFFIhasastandardsignatureforJS-callablefunctions,thekeyargumentofwhich isanarrayofboxedvalues.Externalfunctionscalledthroughthe FFIinteractwiththeprogramstatethroughaninterpreterAPI(e.g., toreadapropertyfromanargument).TherearealsocertaininterpreterbuiltinsthatdonotusetheFFI,butinteractwiththeprogram stateinthesameway,suchasthe CallIteratorNext function usedwithiteratorobjects.TraceMonkeymustsupportthisFFIin ordertospeedupcodethatinteractswiththehostsysteminsidehot loops.
CallingexternalfunctionsfromTraceMonkeyispotentiallydifficultbecausetracesdonotupdatetheinterpreterstateuntilexiting.Inparticular,externalfunctionsmayneedthecallstackorthe globalvariables,buttheymaybeoutofdate.
Fortheout-of-datecallstackproblem,werefactoredsomeof theinterpreterAPIimplementationfunctionstore-materializethe interpretercallstackondemand.
WedevelopedaC++staticanalysisandannotatedsomeinterpreterfunctionsinordertoverifythatthecallstackisrefreshed atanypointitneedstobeused.Inordertoaccessthecallstack, afunctionmustbeannotatedaseitherFORCESSTACK orREQUIRESSTACK.Theseannotationsarealsorequiredinordertocall REQUIRESSTACK functions,whicharepresumedtoaccessthecall stacktransitively.FORCESSTACK isatrustedannotation,applied toonly5functions,thatmeansthefunctionrefreshesthecallstack. REQUIRESSTACK isanuntrustedannotationthatmeansthefunctionmayonlybecalledifthecallstackhasalreadybeenrefreshed.
Similarly,wedetectwhenhostfunctionsattempttodirectly readorwriteglobalvariables,andforcethecurrentlyrunningtrace tosideexit.Thisisnecessarysincewecacheandunboxglobal variablesintotheactivationrecordduringtraceexecution.
Sincebothcall-stackaccessandglobalvariableaccessare rarelyperformedbyhostfunctions,performanceisnotsignificantly affectedbythesesafetymechanisms.
Anotherproblemisthatexternalfunctionscanreentertheinterpreterbycallingscripts,whichinturnagainmightwanttoaccess thecallstackorglobalvariables.Toaddressthisproblem,wemade theVMsetaflagwhenevertheinterpreterisreenteredwhileacompiledtraceisrunning.
Everycalltoanexternalfunctionthenchecksthisflagandexits thetraceimmediatelyafterreturningfromtheexternalfunctioncall ifitisset.Therearemanyexternalfunctionsthatseldomornever reenter,andtheycanbecalledwithoutproblem,andwillcause traceexitonlyifnecessary.
TheFFI’sboxedvaluearrayrequirementhasaperformance cost,sowedefinedanewFFIthatallowsCfunctionstobeannotatedwiththeirargumenttypessothatthetracercancallthem directly,withoutunnecessaryargumentconversions.
Currently,wedonotsupportcallingnativepropertygetandset overridefunctionsorDOMfunctionsdirectlyfromtrace.Support isplannedfuturework.
6.6Correctness
Duringdevelopment,wehadaccesstoexistingJavaScripttest suites,butmostofthemwerenotdesignedwithtracingVMsin mindandcontainedfewloops.
OnetoolthathelpedusgreatlywasMozilla’sJavaScriptfuzz tester, JSFUNFUZZ,whichgeneratesrandomJavaScriptprograms bynestingrandomlanguageelements.Wemodified JSFUNFUZZ togenerateloops,andalsototestmoreheavilycertainconstructs wesuspectedwouldrevealflawsinourimplementation.Forexample,wesuspectedbugsinTraceMonkey’shandlingoftype-unstable
&-./012#3%4%56# &-.789:;#3%4,56# &-.9<=>9</2#3$4%56# <//2??.1@A<9=.>922?#3!4,56# <//2??.B<AAC0/;#3%4%56# <//2??.A18-=#3'4%56# <//2??.A?@2D2#3&4!56# 1@>8:?.&1@>.1@>?.@A.1=>2#3%(4(56# 1@>8:?.1@>?.@A.1=>2#3+4*56# 1@>8:?.1@>E@?2.<A-#3%(4%56# 1@>8:?.A?@2D2.1@>?#3%4*56# /8A>98FG8E.92/09?@D2#3$4!56# /9=:>8.<2?#3$4)56# /9=:>8.7-(#3%4&56# /9=:>8.?;<$#3(4,56# -<>2.B897<>.>8H2#3$4$56# -<>2.B897<>.5:<91#3$4!56# 7<>;./89-@/#3'4,56# 7<>;.:<9I<F.?07?#3(4,56# 7<>;.?:2/>9<F.A897#3*4$56# 92J25:.-A<#3'4%56# ?>9@AJ.1<?2)'#3%4(56# ?>9@AJ.B<?><#3$4(56# ?>9@AJ.><J/F80-#3$4$56# ?>9@AJ.0A:</C./8-2#3$4%56# ?>9@AJ.D<F@-<>2.@A:0>#3$4,56# KA>29:92># L<ID2#
Figure11.Fractionofdynamicbytecodesexecutedbyinterpreterandonnativetraces. Thespeedupvs.interpreterisshown inparenthesesnexttoeachtest.Thefractionofbytecodesexecutedwhilerecordingistoosmalltoseeinthisfigure,except for crypto-md5,wherefully3%ofbytecodesareexecutedwhile recording.Inmostofthetests,almostallthebytecodesareexecutedbycompiledtraces.Threeofthebenchmarksarenottraced atallandrunintheinterpreter.
loopsandheavilybranchingcode,andaspecializedfuzztesterindeedrevealedseveralregressionswhichwesubsequentlycorrected.
7.Evaluation
WeevaluatedourJavaScripttracingimplementationusingSunSpider,theindustrystandardJavaScriptbenchmarksuite.SunSpiderconsistsof26short-running(lessthan250ms,average26ms) JavaScriptprograms.Thisisinstarkcontrasttobenchmarksuites suchasSpecJVM98(3)usedtoevaluatedesktopandserverJava VMs.Manyprogramsinthosebenchmarksuselargedatasetsand executeforminutes.TheSunSpiderprogramscarryoutavarietyof tasks,primarily3drendering,bit-bashing,cryptographicencoding, mathkernels,andstringprocessing.
AllexperimentswereperformedonaMacBookProwith2.2 GHzCore2processorand2GBRAMrunningMacOS10.5.
Benchmarkresults. Themainquestioniswhetherprograms runfasterwithtracing.Forthis,weranthestandardSunSpidertest driver,whichstartsaJavaScriptinterpreter,loadsandrunseach programonceforwarmup,thenloadsandrunseachprogram10 timesandreportstheaveragetimetakenbyeach.Weran4differentconfigurationsforcomparison:(a)SpiderMonkey,thebaseline interpreter,(b)TraceMonkey,(d)SquirrelFishExtreme(SFX),the call-threadedJavaScriptinterpreterusedinApple’sWebKit,and (e)V8,themethod-compilingJavaScriptVMfromGoogle.
Figure10showstherelativespeedupsachievedbytracing,SFX, andV8againstthebaseline(SpiderMonkey).Tracingachievesthe bestspeedupsininteger-heavybenchmarks,uptothe25xspeedup on bitops-bitwise-and TraceMonkeyisthefastestVMon9ofthe26benchmarks (3d-morph, bitops-3bit-bits-in-byte, bitops-bitwiseand, crypto-sha1, math-cordic, math-partial-sums, mathspectral-norm, string-base64, string-validate-input).











































































Figure10. Speedupvs.abaselineJavaScriptinterpreter(SpiderMonkey)forourtrace-basedJITcompiler,Apple’sSquirrelFishExtreme inlinethreadinginterpreterandGoogle’sV8JScompiler.Oursystemgeneratesparticularlyefficientcodeforprogramsthatbenefitmostfrom typespecialization,whichincludesSunSpiderBenchmarkprogramsthatperformbitmanipulation.Wetype-specializethecodeinquestion touseintegerarithmetic,whichsubstantiallyimprovesperformance.Foroneofthebenchmarkprogramsweexecute25timesfasterthan theSpiderMonkeyinterpreter,andalmost5timesfasterthanV8andSFX.ForalargenumberofbenchmarksallthreeVMsproducesimilar results.Weperformworstonbenchmarkprogramsthatwedonottraceandinsteadfallbackontotheinterpreter.Thisincludestherecursive benchmarks access-binary-trees and control-flow-recursive,forwhichwecurrentlydon’tgenerateanynativecode.
Inparticular,the bitops benchmarksareshortprogramsthatperformmanybitwiseoperations,soTraceMonkeycancovertheentireprogramwith1or2tracesthatoperateonintegers.TraceMonkeyrunsalltheotherprogramsinthissetalmostentirelyasnative code.
regexp-dna isdominatedbyregularexpressionmatching, whichisimplementedinall3VMsbyaspecialregularexpression compiler.Thus,performanceonthisbenchmarkhaslittlerelation tothetracecompilationapproachdiscussedinthispaper.
TraceMonkey’ssmallerspeedupsontheotherbenchmarkscan beattributedtoafewspecificcauses:
• Theimplementationdoesnotcurrentlytracerecursion,so TraceMonkeyachievesasmallspeedupornospeedupon benchmarksthatuserecursionextensively: 3d-cube, 3draytrace, access-binary-trees, string-tagcloud,and controlflow-recursive
• Theimplementationdoesnotcurrentlytrace eval andsome otherfunctionsimplementedinC.Because date-formattofte and date-format-xparb usesuchfunctionsintheir mainloops,wedonottracethem.
• Theimplementationdoesnotcurrentlytracethroughregular expression replace operations.Thereplacefunctioncanbe passedafunctionobjectusedtocomputethereplacementtext. Ourimplementationcurrentlydoesnottracefunctionscalled asreplacefunctions.Theruntimeof string-unpack-code is dominatedbysucha replace call.
• Twoprogramstracewell,buthavealongcompilationtime. access-nbody formsalargenumberoftraces(81). crypto-md5 formsoneverylongtrace.Weexpecttoimproveperformance onthisprogramsbyimprovingthecompilationspeedofnanojit.
• Someprogramstraceverywell,andspeedupcomparedto theinterpreter,butarenotasfastasSFXand/orV8,namely bitops-bits-in-byte, bitops-nsieve-bits, accessfannkuch, access-nsieve,and crypto-aes.Thereasonis notclear,butalloftheseprogramshavenestedloopswith smallbodies,sowesuspectthattheimplementationhasarelativelyhighcostforcallingnestedtraces. string-fasta traces well,butitsruntimeisdominatedbystringprocessingbuiltins, whichareunaffectedbytracingandseemtobelessefficientin SpiderMonkeythaninthetwootherVMs.
Detailedperformancemetrics. InFigure11weshowthefractionofinstructionsinterpretedandthefractionofinstructionsexecutedasnativecode.Thisfigureshowsthatformanyprograms,we areabletoexecutealmostallthecodenatively.
Figure12breaksdownthetotalexecutiontimeintofouractivities:interpretingbytecodeswhilenotrecording,recordingtraces (includingtimetakentointerprettherecordedtrace),compiling tracestonativecode,andexecutingnativecodetraces.
Thesedetailedmetricsallowustoestimateparametersfora simplemodeloftracingperformance.Theseestimatesshouldbe consideredveryrough,asthevaluesobservedontheindividual benchmarkshavelargestandarddeviations(ontheorderofthe
LoopsTreesTracesAbortsFlushesTrees/LoopTraces/TreeTraces/LoopSpeedup 3d-cube252729301.11.11.22.20x 3d-morph588201.61.01.62.86x 3d-raytrace10251001012.54.010.01.18x access-binary-trees00050---0.93x access-fannkuch1034572403.41.75.72.20x access-nbody81618502.01.12.34.19x access-nsieve368302.01.32.73.05x bitops-3bit-bits-in-byte222001.01.01.025.47x bitops-bits-in-byte334101.01.31.38.67x bitops-bitwise-and111001.01.01.025.20x bitops-nsieve-bits335001.01.71.72.75x controlflow-recursive00010---0.98x crypto-aes5072781901.41.11.61.64x crypto-md5445001.01.31.32.30x crypto-sha15510001.02.02.05.95x date-format-tofte334701.01.31.31.07x date-format-xparb3311301.03.73.70.98x math-cordic245102.01.32.54.92x math-partial-sums244102.01.02.05.90x math-spectral-norm152020001.31.01.37.12x regexp-dna222001.01.01.04.21x string-base64357001.71.42.32.53x string-fasta51115602.21.43.01.49x string-tagcloud366502.01.02.01.09x string-unpack-code4437001.09.39.31.20x string-validate-input61013101.71.32.21.86x
Figure13. DetailedtracerecordingstatisticsfortheSunSpiderbenchmarkset.
mean).Weexclude regexp-dna fromthefollowingcalculations, becausemostofitstimeisspentintheregularexpressionmatcher, whichhasmuchdifferentperformancecharacteristicsfromthe otherprograms.(Notethatthisonlymakesadifferenceofabout 10%intheresults.)Dividingthetotalexecutiontimeinprocessor clockcyclesbythenumberofbytecodesexecutedinthebase interpretershowsthatonaverage,abytecodeexecutesinabout 35cycles.Nativetracestakeabout9cyclesperbytecode,a3.9x speedupovertheinterpreter.
Usingsimilarcomputations,wefindthattracerecordingtakes about3800cyclesperbytecode,andcompilation3150cyclesper bytecode.Hence,duringrecordingandcompilingtheVMrunsat 1/200thespeedoftheinterpreter.Becauseitcosts6950cyclesto compileabytecode,andwesave26cycleseachtimethatcodeis runnatively,webreakevenafterrunningatrace270times.
TheotherVMswecomparedwithachieveanoverallspeedup of3.0xrelativetoourbaselineinterpreter.Ourestimatednative codespeedupof3.9xissignificantlybetter.Thissuggeststhat ourcompilationtechniquescangeneratemoreefficientnativecode thananyothercurrentJavaScriptVM.
Theseestimatesalsoindicatethatourstartupperformancecould besubstantiallybetterifweimprovedthespeedoftracerecording andcompilation.Theestimated200xslowdownforrecordingand compilationisveryrough,andmaybeinfluencedbystartupfactors intheinterpreter(e.g.,cachesthathavenotwarmedupyetduring recording).Oneobservationsupportingthisconjectureisthatin thetracer,interpretedbytecodestakeabout180cyclestorun.Still, recordingandcompilationareclearlybothexpensive,andabetter implementation,possiblyincludingredesignoftheLIRabstract syntaxorencoding,wouldimprovestartupperformance.
Ourperformanceresultsconfirmthattypespecializationusing tracetreessubstantiallyimprovesperformance.Weareableto outperformthefastestavailableJavaScriptcompiler(V8)andthe
fastestavailableJavaScriptinlinethreadedinterpreter(SFX)on9 of26benchmarks.
8.RelatedWork
Traceoptimizationfordynamiclanguages. Theclosestareaof relatedworkisonapplyingtraceoptimizationtotype-specialize dynamiclanguages.Existingworksharestheideaofgenerating type-specializedcodespeculativelywithguardsalonginterpreter traces.
Toourknowledge,Rigo’sPsyco(16)istheonlypublished type-specializingtracecompilerforadynamiclanguage(Python). Psycodoesnotattempttoidentifyhotloopsorinlinefunctioncalls. Instead,Psycotransformsloopstomutualrecursionbeforerunning andtracesalloperations.
Pall’sLuaJITisaLuaVMindevelopmentthatusestracecompilationideas.(1).TherearenopublicationsonLuaJITbutthecreatorhastoldusthatLuaJIThasasimilardesigntooursystem,but willusealessaggressivetypespeculation(e.g.,usingafloatingpointrepresentationforallnumbervalues)anddoesnotgenerate nestedtracesfornestedloops.
Generaltraceoptimization. Generaltraceoptimizationhas alongerhistorythathastreatedmostlynativecodeandtyped languageslikeJava.Thus,thesesystemshavefocusedlessontype specializationandmoreonotheroptimizations.
Dynamo(7)byBalaetal,introducednativecodetracingasa replacementforprofile-guidedoptimization(PGO).Amajorgoal wastoperformPGOonlinesothattheprofilewasspecificto thecurrentexecution.Dynamousedloopheadersascandidatehot traces,butdidnottrytocreatelooptracesspecifically.
TracetreeswereoriginallyproposedbyGaletal.(11)inthe contextofJava,astaticallytypedlanguage.Theirtracetreesactuallyinlinedpartsofouterloopswithintheinnerloops(because
)*+6:;<6:,/#0(1$23# :,,/==+.>?:6;+<6//=#0!1923# :,,/==+@:??A-,8#0$1$23# :,,/==+?.5*;#0%1$23# :,,/==+?=>/B/#0)1!23# .><57=+).><+.><=+>?+.;</#0$C1C23# .><57=+.><=+>?+.;</#0'1D23# .><57=+.><E>=/+:?*#0$C1$23# .><57=+?=>/B/+.><=#0$1D23# ,5?<65FG5E+6/,-6=>B/#0(1!23# ,6;7<5+:/=#0(1&23# ,6;7<5+4*C#0$1)23# ,6;7<5+=8:(#0C1923# *:</+@564:<+<5H/#0(1(23# *:</+@564:<+27:6.#0(1!23# 4:<8+,56*>,#0%1923# 4:<8+7:6I:F+=-4=#0C1923# 4:<8+=7/,<6:F+?564#0D1(23# 6/J/27+*?:#0%1$23# =<6>?J+.:=/&%#0$1C23# =<6>?J+@:=<:#0(1C23# =<6>?J+<:J,F5-*#0(1(23# =<6>?J+-?7:,A+,5*/#0(1$23# =<6>?J+B:F>*:</+>?7-<#0(1923#
)*+45678#0$1923#
)*+,-./#0$1$23#


Figure12.FractionoftimespentonmajorVMactivities. The speedupvs.interpreterisshowninparenthesesnexttoeachtest. MostprogramswheretheVMspendsthemajorityofitstimerunningnativecodehaveagoodspeedup.Recordingandcompilation costscanbesubstantial;speedingupthosepartsoftheimplementationwouldimproveSunSpiderperformance.
eratenativecodewithnearlythesamestructurebutbetterperformance.
Callthreading,alsoknownascontextthreading(8),compiles methodsbygeneratinganativecallinstructiontoaninterpreter methodforeachinterpreterbytecode.Acall-returnpairhasbeen showntobeapotentiallymuchmoreefficientdispatchmechanism thantheindirectjumpsusedinstandardbytecodeinterpreters.
Inlinethreading(15)copieschunksofinterpreternativecode whichimplementtherequiredbytecodesintoanativecodecache, thusactingasasimpleper-methodJITcompilerthateliminatesthe dispatchoverhead.
Neithercallthreadingnorinlinethreadingperformtypespecialization.
Apple’sSquirrelFishExtreme(5)isaJavaScriptimplementationbasedoncallthreadingwithselectiveinlinethreading.Combinedwithefficientinterpreterengineering,thesethreadingtechniqueshavegivenSFXexcellentperformanceonthestandardSunSpiderbenchmarks.
Google’sV8isaJavaScriptimplementationprimarilybased oninlinethreading,withcallthreadingonlyforverycomplex operations.
9.Conclusions
innerloopsbecomehotfirst),leadingtomuchgreatertailduplication.
YETI,fromZaleskietal.(19)appliedDynamo-styletracing toJavainordertoachieveinlining,indirectjumpelimination, andotheroptimizations.Theirprimaryfocuswasondesigningan interpreterthatcouldeasilybegraduallyre-engineeredasatracing VM.
Suganumaetal.(18)describedregion-basedcompilation(RBC), arelativeoftracing.Aregionisansubprogramworthoptimizing thatcanincludesubsetsofanynumberofmethods.Thus,thecompilerhasmoreflexibilityandcanpotentiallygeneratebettercode, buttheprofilingandcompilationsystemsarecorrespondinglymore complex.
Typespecializationfordynamiclanguages. Dynamiclanguageimplementorshavelongrecognizedtheimportanceoftype specializationforperformance.Mostpreviousworkhasfocusedon methodsinsteadoftraces.
Chamberset.al(9)pioneeredtheideaofcompilingmultiple versionsofaprocedurespecializedfortheinputtypesinthelanguageSelf.Inoneimplementation,theygeneratedaspecialized methodonlineeachtimeamethodwascalledwithnewinputtypes. Inanother,theyusedanofflinewhole-programstaticanalysisto inferinputtypesandconstantreceivertypesatcallsites.Interestingly,thetwotechniquesproducednearlythesameperformance.
Salib(17)designedatypeinferencealgorithmforPythonbased ontheCartesianProductAlgorithmandusedtheresultstospecializeontypesandtranslatetheprogramtoC++.
McCloskey(14)hasworkinprogressbasedonalanguageindependenttypeinferencethatisusedtogenerateefficientC implementationsofJavaScriptandPythonprograms. Nativecodegenerationbyinterpreters. ThetraditionalinterpreterdesignisavirtualmachinethatdirectlyexecutesASTsor machine-code-likebytecodes.Researchershaveshownhowtogen-
Thispaperdescribedhowtorundynamiclanguagesefficientlyby recordinghottracesandgeneratingtype-specializednativecode. Ourtechniquefocusesonaggressivelyinlinedloops,andforeach loop,itgeneratesatreeofnativecodetracesrepresentingthe pathsandvaluetypesthroughtheloopobservedatruntime.We explainedhowtoidentifyloopnestingrelationshipsandgenerate nestedtracesinordertoavoidexcessivecodeduplicationdue tothemanypathsthroughaloopnest.Wedescribedourtype specializationalgorithm.Wealsodescribedourtracecompiler, whichtranslatesatracefromanintermediaterepresentationto optimizednativecodeintwolinearpasses.
Ourexperimentalresultsshowthatinpracticeloopstypically areenteredwithonlyafewdifferentcombinationsofvaluetypes ofvariables.Thus,asmallnumberoftracesperloopissufficient torunaprogramefficiently.Ourexperimentsalsoshowthaton programsamenabletotracing,weachievespeedupsof2xto20x.
10.FutureWork
Workisunderwayinanumberofareastofurtherimprovethe performanceofourtrace-basedJavaScriptcompiler.Wecurrently donottraceacrossrecursivefunctioncalls,butplantoaddthe supportforthiscapabilityinthenearterm.Wearealsoexploring adoptionoftheexistingworkontreerecompilationinthecontext ofthepresenteddynamiccompilerinordertominimizeJITpause timesandobtainthebestofbothworlds,fasttreestitchingaswell astheimprovedcodequalityduetotreerecompilation.
Wealsoplanonaddingsupportfortracingacrossregularexpressionsubstitutionsusinglambdafunctions,functionapplicationsandexpressionevaluationusing eval.Alltheselanguage constructsarecurrentlyexecutedviainterpretation,whichlimits ourperformanceforapplicationsthatusethosefeatures.
Acknowledgments
PartsofthisefforthavebeensponsoredbytheNationalScience FoundationundergrantsCNS-0615443andCNS-0627747,aswell asbytheCaliforniaMICROProgramandindustrialsponsorSun MicrosystemsunderProjectNo.07-127.
TheU.S.Governmentisauthorizedtoreproduceanddistribute reprintsforGovernmentalpurposesnotwithstandinganycopyright annotationthereon.Anyopinions,findings,andconclusionsorrecommendationsexpressedherearethoseoftheauthorandshould
notbeinterpretedasnecessarilyrepresentingtheofficialviews, policiesorendorsements,eitherexpressedorimplied,oftheNationalSciencefoundation(NSF),anyotheragencyoftheU.S.Government,oranyofthecompaniesmentionedabove.
References
[1]LuaJITroadmap2008-http://lua-users.org/lists/lua-l/200802/msg00051.html.
[2]Mozilla—FirefoxwebbrowserandThunderbirdemailclienthttp://www.mozilla.com.
[3]SPECJVM98-http://www.spec.org/jvm98/.
[4]SpiderMonkey(JavaScript-C)Enginehttp://www.mozilla.org/js/spidermonkey/.
[5]Surfin’Safari-BlogArchive-AnnouncingSquirrelFishExtremehttp://webkit.org/blog/214/introducing-squirrelfish-extreme/.
[6]A.Aho,R.Sethi,J.Ullman,andM.Lam.Compilers:Principles, techniques,andtools,2006.
[7]V.Bala,E.Duesterwald,andS.Banerjia.Dynamo:Atransparent dynamicoptimizationsystem.In ProceedingsoftheACMSIGPLAN ConferenceonProgrammingLanguageDesignandImplementation, pages1–12.ACMPress,2000.
[8]M.Berndl,B.Vitale,M.Zaleski,andA.Brown.ContextThreading: aFlexibleandEfficientDispatchTechniqueforVirtualMachineInterpreters.In CodeGenerationandOptimization,2005.CGO2005. InternationalSymposiumon,pages15–26,2005.
[9]C.ChambersandD.Ungar.Customization:OptimizingCompiler TechnologyforSELF,aDynamically-TypedObject-OrientedProgrammingLanguage.In ProceedingsoftheACMSIGPLAN1989 ConferenceonProgrammingLanguageDesignandImplementation, pages146–160.ACMNewYork,NY,USA,1989.
[10]A.Gal. EfficientBytecodeVerificationandCompilationinaVirtual MachineDissertation.PhDthesis,UniversityOfCalifornia,Irvine, 2006.
[11]A.Gal,C.W.Probst,andM.Franz.HotpathVM:AneffectiveJIT compilerforresource-constraineddevices.In Proceedingsofthe InternationalConferenceonVirtualExecutionEnvironments,pages 144–153.ACMPress,2006.
[12]C.Garrett,J.Dean,D.Grove,andC.Chambers.Measurementand ApplicationofDynamicReceiverClassDistributions.1994.
[13]J.Ha,M.R.Haghighat,S.Cong,andK.S.McKinley.Aconcurrent trace-basedjust-in-timecompilerforjavascript.Dept.ofComputer Sciences,TheUniversityofTexasatAustin,TR-09-06,2009.
[14]B.McCloskey.Personalcommunication.
[15]I.PiumartaandF.Riccardi.Optimizingdirectthreadedcodebyselectiveinlining.In ProceedingsoftheACMSIGPLAN1998conference onProgramminglanguagedesignandimplementation,pages291–300.ACMNewYork,NY,USA,1998.
[16]A.Rigo.Representation-BasedJust-In-timeSpecializationandthe PsycoPrototypeforPython.In PEPM,2004.
[17]M.Salib.Starkiller:AStaticTypeInferencerandCompilerfor Python.In Master’sThesis,2004.
[18]T.Suganuma,T.Yasue,andT.Nakatani.ARegion-BasedCompilationTechniqueforDynamicCompilers. ACMTransactionsonProgrammingLanguagesandSystems(TOPLAS),28(1):134–174,2006.
[19]M.Zaleski,A.D.Brown,andK.Stoodley.YETI:AgraduallY ExtensibleTraceInterpreter.In ProceedingsoftheInternational ConferenceonVirtualExecutionEnvironments,pages83–93.ACM Press,2007.