Issuu on Google+

A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark

7UDQVDFWLRQDO0HPRU\&RKHUHQFHDQG&RQVLVWHQF\ /DQFH+DPPRQG9LFN\:RQJ0LNH&KHQ%ULDQ'&DUOVWURP-RKQ''DYLV%HQ+HUW]EHUJ 0DQRKDU.3UDEKX+RQJJR:LMD\D&KULVWRV.R]\UDNLVDQG.XQOH2OXNRWXQ 6WDQIRUG8QLYHUVLW\ ^ODQFHYLFZRQJEURFFROLEGFMRKQGHOHNWULNPNSUDEKXZLMD\DK`#VWDQIRUGHGX FKULVWRV#HHVWDQIRUGHGXNXQOH#VWDQIRUGHGX

$EVWUDFW ,QWKLVSDSHUZHSURSRVHDQHZVKDUHGPHPRU\PRGHO7UDQVDF WLRQDO PHPRU\ &RKHUHQFH DQG &RQVLVWHQF\ 7&&   7&& SUR YLGHVDPRGHOLQZKLFKDWRPLFWUDQVDFWLRQVDUHDOZD\VWKHEDVLF XQLW RI SDUDOOHO ZRUN FRPPXQLFDWLRQ PHPRU\ FRKHUHQFH DQG PHPRU\ UHIHUHQFH FRQVLVWHQF\  7&& JUHDWO\ VLPSOLÀHV SDUDOOHO VRIWZDUHE\HOLPLQDWLQJWKHQHHGIRUV\QFKURQL]DWLRQXVLQJFRQ YHQWLRQDOORFNVDQGVHPDSKRUHVDORQJZLWKWKHLUFRPSOH[LWLHV 7&&KDUGZDUHPXVWFRPELQHDOOZULWHVIURPHDFKWUDQVDFWLRQUH JLRQLQDSURJUDPLQWRDVLQJOHSDFNHWDQGEURDGFDVWWKLVSDFNHW WRWKHSHUPDQHQWVKDUHGPHPRU\VWDWHDWRPLFDOO\DVDODUJHEORFN 7KLV VLPSOLÀHV WKH FRKHUHQFH KDUGZDUH EHFDXVH LW UHGXFHV WKH QHHGIRUVPDOOORZODWHQF\PHVVDJHVDQGFRPSOHWHO\HOLPLQDWHV WKHQHHGIRUFRQYHQWLRQDOVQRRS\FDFKHFRKHUHQFHSURWRFROVDV PXOWLSOHVSHFXODWLYHO\ZULWWHQYHUVLRQVRIDFDFKHOLQHPD\VDIHO\ FRH[LVWZLWKLQWKHV\VWHP0HDQZKLOHDXWRPDWLFKDUGZDUHFRQ WUROOHGUROOEDFNRIVSHFXODWLYHWUDQVDFWLRQVUHVROYHVDQ\FRUUHFW QHVVYLRODWLRQVWKDWPD\RFFXUZKHQVHYHUDOSURFHVVRUVDWWHPSW WRUHDGDQGZULWHWKHVDPHGDWDVLPXOWDQHRXVO\7KHFRVWRIWKLV VLPSOLÀHGVFKHPHLVKLJKHULQWHUSURFHVVRUEDQGZLGWK 7RH[SORUHWKHFRVWVDQGEHQHÀWVRI7&&ZHVWXG\WKHFKDUDFWHU LVWLFVRIDQRSWLPDOWUDQVDFWLRQEDVHGPHPRU\V\VWHPDQGH[DP LQHKRZGLIIHUHQWGHVLJQSDUDPHWHUVFRXOGDIIHFWWKHSHUIRUPDQFH RIUHDOV\VWHPV$FURVVDVSHFWUXPRIDSSOLFDWLRQVWKH7&&PRGHO LWVHOIGLGQRWOLPLWDYDLODEOHSDUDOOHOLVP0RVWDSSOLFDWLRQVDUH HDVLO\ GLYLGHG LQWR WUDQVDFWLRQV UHTXLULQJ RQO\ VPDOO ZULWH EXI IHUVRQWKHRUGHURI².%7KHEURDGFDVWUHTXLUHPHQWVRI7&& DUHKLJKEXWDUHZHOOZLWKLQWKHFDSDELOLWLHVRI&03VDQGVPDOO VFDOH603VZLWKKLJKVSHHGLQWHUFRQQHFWV

,1752'8&7,21 3DUDOOHOSURFHVVRUVKDYHEHFRPHLQFUHDVLQJO\FRPPRQDQGPRUH GHQVHO\SDFNHGLQUHFHQW\HDUV,QWKHQHDUIXWXUHWKHVHV\VWHPV ZLOOSURYLGHPDQ\FRQYHQWLRQDOSURFHVVRUVSDFNHGWRJHWKHULQWR FKLS PXOWLSURFHVVRUV &03V  RU VLQJOHERDUG V\VWHPV LQWHUFRQ QHFWHG ZLWK VRPH IRUP RI KLJKEDQGZLGWK FRPPXQLFDWLRQ EXV RUQHWZRUN:LWKWKHVHV\VWHPVHQRXJKEDQGZLGWKFDQEHSUR YLGHGEHWZHHQSURFHVVRUVWRHYHQDOORZWKHEURDGFDVWRIVLJQLÀ FDQWDPRXQWVRIGDWDDQGRUSURWRFRORYHUKHDGEHWZHHQDOORIWKH SURFHVVRU QRGHV RYHU D ORZODWHQF\ XQRUGHUHG LQWHUFRQQHFW >     @  2YHUZKHOPLQJO\ GHVLJQHUV RI WRGD\·V SDUDO OHO SURFHVVLQJ V\VWHPV KDYH FKRVHQ WR XVH RQH RI WZRFRPPRQ PRGHOVWRFRRUGLQDWHFRPPXQLFDWLRQDQGV\QFKURQL]DWLRQLQWKHLU V\VWHPVPHVVDJHSDVVLQJRUVKDUHGPHPRU\*LYHQWKHDGYHQWRI QHZHUV\VWHPVZLWKLPPHQVHLQWHUSURFHVVRUEDQGZLGWKKRZHYHU

ZH ZRQGHUHG LI LW ZRXOG EH SRVVLEOH WR WDNH DGYDQWDJH RI WKLV EDQGZLGWKWRVLPSOLI\WKHSURWRFROVXVHGWRPDQDJHFRPPXQLFD WLRQDQGV\QFKURQL]DWLRQEHWZHHQSURFHVVRUVLQDV\VWHP 0HVVDJHSDVVLQJLVDV\VWHPWKDWVXSSRUWVUHODWLYHO\VLPSOHKDUG ZDUHFRQÀJXUDWLRQVVXFKDVFOXVWHUVRIZRUNVWDWLRQVEXWPDNHV SURJUDPPHUVZRUNKDUGWRWDNHDGYDQWDJHRIWKHKDUGZDUH7KH SURJUDPPLQJPRGHOLVRQHRIPDQ\LQGHSHQGHQWQRGHVWKDWPXVW SDVVH[SOLFLWPHVVDJHVEHWZHHQHDFKRWKHUZKHQFRPPXQLFDWLRQ LVQHFHVVDU\0HVVDJHVDOVRLPSOLFLWO\V\QFKURQL]HSURFHVVRUVDV WKH\DUHVHQWDQGUHFHLYHG7KLVWHFKQLTXHW\SLFDOO\PDNHVWKH XQGHUO\LQJKDUGZDUHPXFKVLPSOHUE\PDNLQJSURJUDPPHUVFRQ FHQWUDWH WKHLU FRPPXQLFDWLRQ LQWR D UHODWLYHO\ VPDOO QXPEHU RI ODUJHGDWDSDFNHWVWKDWFDQÁRZWKURXJKRXWWKHV\VWHPZLWKUHOD WLYHO\UHOD[HGODWHQF\UHTXLUHPHQWV7RIDFLOLWDWHWKLVSURJUDP PHUVPXVWGLYLGHGDWDVWUXFWXUHVDQGH[HFXWLRQLQWRLQGHSHQGHQW XQLWVWKDWFDQH[HFXWHHIÀFLHQWO\RQLQGLYLGXDOSURFHVVRUQRGHV ,QFRQWUDVWVKDUHGPHPRU\DGGVDGGLWLRQDOKDUGZDUHWRSURYLGH SURJUDPPHUVZLWKDQLOOXVLRQRIDVLQJOHVKDUHGPHPRU\FRPPRQ WR DOO SURFHVVRUV DYRLGLQJ RU PLQLPL]LQJ WKH SUREOHP RI PDQ XDO GDWD GLVWULEXWLRQ  7KLV LV DFFRPSOLVKHG E\ WUDFNLQJ VKDUHG FDFKH OLQHV DV WKH\ PRYH WKURXJKRXW WKH V\VWHP HLWKHU WKURXJK WKHXVHRIDVQRRS\EXVFRKHUHQFHSURWRFRORYHUDVKDUHGEXV> @RUWKURXJKDGLUHFWRU\EDVHGFRKHUHQFHPHFKDQLVPRYHUDQ XQRUGHUHG LQWHUFRQQHFW > @  3URJUDPPHUV PXVW VWLOO GLYLGH WKHLUFRPSXWDWLRQLQWRSDUDOOHOWDVNVEXWDOOWDVNVFDQZRUNZLWK DVLQJOHFRPPRQGDWDVHWUHVLGHQWLQPHPRU\:KLOHWKLVPRGHO VLJQLÀFDQWO\UHGXFHVWKHGLIÀFXOW\LQKHUHQWLQSDUDOOHOSURJUDP PLQJHVSHFLDOO\IRUSURJUDPVWKDWH[KLELWG\QDPLFFRPPXQLFD WLRQRUÀQHJUDLQVKDULQJWKHKDUGZDUHUHTXLUHGWRVXSSRUWLWFDQ EH YHU\ FRPSOH[ >@  ,Q RUGHU WR SURYLGH D FRKHUHQW YLHZ RI PHPRU\WKHKDUGZDUHPXVWWUDFNZKHUHWKHODWHVWYHUVLRQRIDQ\ SDUWLFXODUPHPRU\DGGUHVVFDQEHIRXQGUHFRYHUWKHODWHVWYHU VLRQRIDFDFKHOLQHIURPDQ\ZKHUHRQWKHV\VWHPZKHQDORDG IURPLWRFFXUVDQGHIÀFLHQWO\VXSSRUWWKHFRPPXQLFDWLRQRIODUJH QXPEHUVRIVPDOOFDFKHOLQHVL]HGSDFNHWVRIGDWDEHWZHHQSUR FHVVRUV$OOWKLVPXVWEHGRQHZLWKPLQLPDOODWHQF\WRRVLQFH LQGLYLGXDO ORDG DQG VWRUH LQVWUXFWLRQV DUH GHSHQGHQW XSRQ HDFK FRPPXQLFDWLRQHYHQW$FKLHYLQJKLJKSHUIRUPDQFHGHVSLWHWKH SUHVHQFHRIORQJLQWHUSURFHVVRUODWHQFLHVLVWKHUHIRUHDSUREOHP ZLWK WKHVH V\VWHPV  )XUWKHU FRPSOLFDWLQJ PDWWHUV LV WKH SURE OHPRIVHTXHQFLQJWKHYDULRXVFRPPXQLFDWLRQHYHQWVFRQVWDQWO\ SDVVLQJ WKURXJKRXW WKH V\VWHP RQ WKH JUDQXODULW\ RI LQGLYLGXDO ORDGDQGVWRUHLQVWUXFWLRQV8QIRUWXQDWHO\VKDUHGPHPRU\GRHV QRWSURYLGHWKHLPSOLFLWV\QFKURQL]DWLRQRIPHVVDJHSDVVLQJVR KDUGZDUHUXOHV³PHPRU\FRQVLVWHQF\PRGHOV>@³KDYHEHHQ GHYLVHG DQG VRIWZDUH V\QFKURQL]DWLRQ URXWLQHV KDYH EHHQ FDUH

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


IXOO\FUDIWHGDURXQGWKHVHUXOHVWRSURYLGHWKHQHFHVVDU\V\QFKUR QL]DWLRQ2YHUWKH\HDUVWKHPHPRU\FRQVLVWHQF\PRGHOKDVSUR JUHVVHGIURPWKHHDV\WRXQGHUVWDQGEXWVRPHWLPHVSHUIRUPDQFH OLPLWLQJVHTXHQWLDOFRQVLVWHQF\>@WRPRUHPRGHUQVFKHPHV VXFKDVUHOD[HGFRQVLVWHQF\>@7KHFRPSOH[LQWHUDFWLRQRI FRKHUHQFHV\QFKURQL]DWLRQDQGFRQVLVWHQF\FDQSRWHQWLDOO\PDNH WKHMRERISDUDOOHOSURJUDPPLQJRQVKDUHGPHPRU\DUFKLWHFWXUHV GLIÀFXOW %RWKRIWKHVHPRGHOVWKHUHIRUHKDYHGUDZEDFNV³PHVVDJHSDVV LQJ PDNHV VRIWZDUH GHVLJQ GLIÀFXOW ZKLOH VKDUHG PHPRU\ UH TXLUHVFRPSOH[KDUGZDUHWRJHWRQO\DVOLJKWO\VLPSOHUSURJUDP PLQJPRGHO,GHDOO\ZHZRXOGOLNHDFRPPXQLFDWLRQPRGHOWKDW ZLWKRXW UDLVLQJ PHPRU\ FRQVLVWHQF\ LVVXHV SUHVHQWV D VKDUHG PHPRU\ PRGHO WR SURJUDPPHUV DQG VLJQLÀFDQWO\ UHGXFHV WKH QHHG IRU KDUGZDUH WR VXSSRUW IUHTXHQW ODWHQF\VHQVLWLYH FRKHU HQFHUHTXHVWVIRULQGLYLGXDOFDFKHOLQHV$WWKHVDPHWLPHZH ZRXOGOLNHWREHDEOHWRWDNHDGYDQWDJHRIWKHLQKHUHQWV\QFKUR QL]DWLRQ DQG ODWHQF\WROHUDQFH RI PHVVDJH SDVVLQJ SURWRFROV 5HSODFLQJ FRQYHQWLRQDO FDFKHOLQH RULHQWHG FRKHUHQFH SURWR FROVDQGFRQYHQWLRQDOVKDUHGPHPRU\FRQVLVWHQF\PRGHOVZLWKD 7UDQVDFWLRQDOPHPRU\&RKHUHQFHDQG&RQVLVWHQF\ 7&& PRGHO FDQDFFRPSOLVKWKLV 7KH7&&V\VWHPLVGHVFULEHGLQGHWDLOLQ6HFWLRQRIWKLVSDSHU DQGFRPSDUHGIXUWKHUZLWKVWDWHRIWKHDUWFRKHUHQFHDQGFRQVLV WHQF\VROXWLRQVLQ6HFWLRQ6HFWLRQH[DPLQHVVHYHUDOGHVLJQ DOWHUQDWLYHVWKDWHQJLQHHUVLPSOHPHQWLQJD7&&V\VWHPPD\QHHG WRFRQVLGHU6HFWLRQVDQGGHVFULEHRXULQLWLDODQDO\WLFDOIUDPH ZRUN IRU LQYHVWLJDWLQJ 7&& DQG VRPH UHVXOWV WKDW ZH REWDLQHG ZLWKDYDULHW\RIEHQFKPDUNVXVLQJLW)LQDOO\6HFWLRQORRNV DKHDGWRVRPHIXWXUHZRUNDQG6HFWLRQFRQFOXGHV

6<67(029(59,(: 3URFHVVRUVVXFKDVWKRVHLQ)LJRSHUDWLQJXQGHUD7&&PRGHO FRQWLQXDOO\H[HFXWHVSHFXODWLYHWUDQVDFWLRQV$WUDQVDFWLRQLVD VHTXHQFHRILQVWUXFWLRQVWKDWLVJXDUDQWHHGWRH[HFXWHDQGFRP SOHWHRQO\DVDQDWRPLFXQLW(DFKWUDQVDFWLRQSURGXFHVDEORFNRI ZULWHVFDOOHGWKHZULWHVWDWHZKLFKDUHFRPPLWWHGWRVKDUHGPHP RU\RQO\DVDQDWRPLFXQLWDIWHUWKHWUDQVDFWLRQFRPSOHWHVH[HFX WLRQ 2QFH WKH WUDQVDFWLRQ LV FRPSOHWH KDUGZDUH PXVW DUELWUDWH V\VWHPZLGHIRUWKHSHUPLVVLRQWRFRPPLWLWVZULWHV$IWHUWKLV SHUPLVVLRQLVJUDQWHGWKHSURFHVVRUFDQWDNHDGYDQWDJHRIKLJK V\VWHPLQWHUFRQQHFWEDQGZLGWKVWRVLPSO\EURDGFDVWDOOZULWHVIRU WKH HQWLUH WUDQVDFWLRQ RXW DV RQH ODUJH SDFNHW WR WKH UHVW RI WKH V\VWHP1RWHWKDWWKHEURDGFDVWFDQEHRYHUDQXQRUGHUHGLQWHU FRQQHFWZLWKLQGLYLGXDOVWRUHVVHSDUDWHGDQGUHRUGHUHGDVORQJ DVVWRUHVIURPGLIIHUHQWFRPPLWVDUHQRWUHRUGHUHGRURYHUODSSHG 6QRRSLQJ E\ RWKHU SURFHVVRUV RQ WKHVH VWRUH SDFNHWV PDLQWDLQV FRKHUHQFH LQ WKH V\VWHP DQG DOORZV WKHP WR GHWHFW ZKHQ WKH\ KDYHXVHGGDWDWKDWKDVVXEVHTXHQWO\EHHQPRGLÀHGE\DQRWKHU WUDQVDFWLRQDQGPXVWUROOEDFN³DGHSHQGHQFHYLRODWLRQ&RP ELQLQJDOOZULWHVIURPWKHHQWLUHWUDQVDFWLRQWRJHWKHUPLQLPL]HV WKHODWHQF\VHQVLWLYLW\RIWKLVVFKHPHEHFDXVHIHZHULQWHUSURFHV VRUPHVVDJHVDQGDUELWUDWLRQVDUHUHTXLUHGDQGEHFDXVHÁXVKLQJ RXWWKHZULWHVWDWHLVDRQHZD\RSHUDWLRQ$WWKHVDPHWLPHVLQFH ZHRQO\QHHGWRFRQWUROWKHVHTXHQFLQJEHWZHHQHQWLUHWUDQVDF WLRQVLQVWHDGRILQGLYLGXDOORDGVDQGVWRUHVZHOHYHUDJHWKHFRP

6WRUHV 2QO\

/RDGV DQG 6WRUHV

3URFHVVRU &RUH

/RFDO &DFKH +LHUDUFK\

:ULWH %XIIHU

5H QDPH

5HDG 0

6QRRSLQJ

IURPRWKHUQRGHV

&RPPLWV

WRRWKHUQRGHV

9

7DJ

1RGH 

'DWD

&RPPLW &RQWURO 1RGH 1RGH 1RGH

6HTXHQFH

3KDVH

%URDGFDVW %XV RU 1HWZRUN )LJXUH$VDPSOHQRGH7&&V\VWHP PLWRSHUDWLRQWRSURYLGHLQKHUHQWV\QFKURQL]DWLRQDQGDJUHDWO\ VLPSOLÀHGFRQVLVWHQF\SURWRFRO 7KLV FRQWLQXDO VSHFXODWLYH EXIIHULQJ EURDGFDVW DQG SRWHQWLDO  YLRODWLRQF\FOHLOOXVWUDWHGLQ)LJDDOORZVXVWRUHSODFHFRQYHQ WLRQDOFRKHUHQFHDQGFRQVLVWHQFHSURWRFROVVLPXOWDQHRXVO\ ‡ &RQVLVWHQFH  ,QVWHDG RI DWWHPSWLQJ WR LPSRVH VRPH VRUW RI RUGHULQJUXOHVEHWZHHQLQGLYLGXDOPHPRU\UHIHUHQFHLQVWUXF WLRQVDVZLWKPRVWFRQVLVWHQF\PRGHOV7&&MXVWLPSRVHVD VHTXHQWLDO RUGHULQJ EHWZHHQ WUDQVDFWLRQ FRPPLWV  7KLV FDQ GUDVWLFDOO\UHGXFHWKHQXPEHURIODWHQF\VHQVLWLYHDUELWUDWLRQ DQGV\QFKURQL]DWLRQHYHQWVUHTXLUHGE\ORZOHYHOSURWRFROVLQ DW\SLFDOPXOWLSURFHVVRUV\VWHP$VIDUDVWKHJOREDOPHPRU\ VWDWHDQGVRIWZDUHLVFRQFHUQHGDOOPHPRU\UHIHUHQFHVIURPD SURFHVVRUWKDWFRPPLWVHDUOLHUKDSSHQHG´EHIRUHµDOOPHPRU\ UHIHUHQFHV IURP D SURFHVVRU WKDW FRPPLWV DIWHUZDUGV HYHQ LIWKHUHIHUHQFHVDFWXDOO\H[HFXWHGLQDQLQWHUOHDYHGIDVKLRQ $ SURFHVVRU WKDW UHDGV GDWD WKDW LV VXEVHTXHQWO\ XSGDWHG E\ DQRWKHU SURFHVVRU·V FRPPLW EHIRUH LW FDQ FRPPLW LWVHOI LV IRUFHGWRYLRODWHDQGUROOEDFNLQRUGHUWRHQIRUFHWKLVPRGHO ,QWHUOHDYLQJ EHWZHHQ SURFHVVRUV· PHPRU\ UHIHUHQFHV LV RQO\ DOORZHG DW WUDQVDFWLRQ ERXQGDULHV JUHDWO\ VLPSOLI\LQJ WKH SURFHVVRIZULWLQJSURJUDPVWKDWPDNHÀQHJUDLQHGDFFHVVWR VKDUHGYDULDEOHV,QIDFWE\LPSRVLQJDQRULJLQDOVHTXHQWLDO SURJUDP·V RULJLQDO WUDQVDFWLRQ RUGHU RQ WKH WUDQVDFWLRQ FRP PLWVZHFDQHIIHFWLYHO\OHWWKH7&&V\VWHPSURYLGHDQLOOXVLRQ RI XQLSURFHVVRU H[HFXWLRQ WR WKH VHTXHQFH RI PHPRU\ UHIHU HQFHVJHQHUDWHGE\SDUDOOHOVRIWZDUH ‡ &RKHUHQFH6WRUHVDUHEXIIHUHGDQGNHSWZLWKLQWKHSURFHVVRU QRGH IRU WKH GXUDWLRQ RI WKH WUDQVDFWLRQ LQ RUGHU WR PDLQWDLQ WKHDWRPLFLW\RIWKHWUDQVDFWLRQ1RFRQYHQWLRQDO0(6,VW\OH FDFKHSURWRFROVDUHXVHGWRPDLQWDLQOLQHVLQ´VKDUHGµRU´H[ FOXVLYHµVWDWHVDWDQ\SRLQWLQWKHV\VWHPVRLWLVOHJDOIRUPDQ\ SURFHVVRUQRGHVWRKROGWKHVDPHOLQHVLPXOWDQHRXVO\LQHLWKHU DQXQPRGLÀHGRUVSHFXODWLYHO\PRGLÀHGIRUP$WWKHHQGRI

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


HDFK WUDQVDFWLRQ WKH EURDGFDVW QRWLÀHV DOO RWKHU SURFHVVRUV DERXWZKDWVWDWHKDVFKDQJHGGXULQJWKHFRPSOHWLQJWUDQVDF WLRQ'XULQJWKLVSURFHVVWKH\SHUIRUPFRQYHQWLRQDOLQYDOLGD WLRQ LIWKHFRPPLWSDFNHWRQO\FRQWDLQVDGGUHVVHV RUXSGDWH LI LW FRQWDLQV DGGUHVVHV DQG GDWD  WR NHHS WKHLU FDFKH VWDWH FRKHUHQW  6LPXOWDQHRXVO\ WKH\ PXVW GHWHUPLQH LI WKH\ PD\ KDYH XVHG VKDUHG GDWD WRR HDUO\  ,I WKH\ KDYH UHDG DQ\ GDWD PRGLÀHGE\WKHFRPPLWWLQJWUDQVDFWLRQGXULQJWKHLUFXUUHQWO\ H[HFXWLQJWUDQVDFWLRQWKH\DUHIRUFHGWRUHVWDUWDQGUHORDGWKH FRUUHFWGDWD7KLVKDUGZDUHPHFKDQLVPSURWHFWVDJDLQVWWUXH GDWDGHSHQGHQFLHVDXWRPDWLFDOO\ZLWKRXWUHTXLULQJSURJUDP PHUVWRLQVHUWORFNVRUUHODWHGFRQVWUXFWV$WWKHVDPHWLPH GDWDDQWLGHSHQGHQFLHVDUHKDQGOHGVLPSO\E\WKHIDFWWKDWODWHU SURFHVVRUVZLOOHYHQWXDOO\JHWWKHLURZQWXUQWRÁXVKRXWGDWD WRPHPRU\8QWLOWKDWSRLQWWKHLU´ODWHUµUHVXOWVDUHQRWVHHQ E\WUDQVDFWLRQVWKDWFRPPLWHDUOLHU DYRLGLQJ:$5GHSHQGHQ FLHV DQGWKH\DUHDEOHWRIUHHO\RYHUZULWHSUHYLRXVO\PRGLÀHG GDWDLQDFOHDUO\VHTXHQFHGPDQQHU KDQGOLQJ:$:GHSHQGHQ FLHV LQ D OHJDO ZD\   (IIHFWLYHO\ WKH VLPSOH VHTXHQWLDOL]HG FRQVLVWHQFH PRGHO DOORZV WKH FRKHUHQFH PRGHO WR EH JUHDWO\ VLPSOLÀHGDVZHOO $OWKRXJKVRPHRIWKHGHWDLOVDQGLPSOHPHQWDWLRQDOWHUQDWLYHVDGG PRUHFRPSOH[LW\WKLVVLPSOHF\FOHLVWKHEDFNERQHRIWKH7&& V\VWHPDQGXQGHUOLHVDOORWKHUGHVFULSWLRQVRIWKHV\VWHPWKURXJK RXWWKHUHVWRIWKLVSDSHU

3URJUDPPLQJ0RGHO )RU SURJUDPPHUV WKHUH LV UHDOO\ RQO\ RQH UHTXLUHPHQW IRU VXF FHVVIXOWUDQVDFWLRQDOH[HFXWLRQWKHSURJUDPPHUPXVWLQVHUWWUDQV DFWLRQERXQGDULHVLQWRWKHLUSDUDOOHOFRGHRFFDVLRQDOO\ SRVVLEO\ ZLWKVRPHKDUGZDUHDLGVHH6HFWLRQ 1RFRPSOH[VHTXHQFHV RIVSHFLDOLQVWUXFWLRQVVXFKDVORFNVVHPDSKRUHVRUPRQLWRUVDUH HYHU QHFHVVDU\ WR FRQWURO ORZOHYHO LQWHUSURFHVVRU FRPPXQLFD WLRQDQGV\QFKURQL]DWLRQ,QPDQ\UHVSHFWVWKLVPRGHOLVYHU\ VLPLODU WR WKH WHFKQLTXH RI SHUIRUPLQJ PDQXDO SDUDOOHOL]DWLRQ ZLWKDVVLVWDQFHIURPWKUHDGOHYHOVSHFXODWLRQ 7/6VHH6HFWLRQ  KDUGZDUHWKDWZHSUHYLRXVO\LQYHVWLJDWHGLQ>@7KHUHLV RQO\RQHKDUGUXOHWKDWSURJUDPPHUVPXVWNHHSLQPLQG7UDQV DFWLRQEUHDNVVKRXOGQHYHUEHLQVHUWHGGXULQJWKHFRGHEHWZHHQD ORDGDQGDQ\VXEVHTXHQWVWRUHRIDVKDUHGYDOXH LHGXULQJDFRQ YHQWLRQDOORFN·VFULWLFDOUHJLRQ 8QOLNHZLWKFRQYHQWLRQDOSDUDO OHOL]DWLRQ RWKHU ´HUURUVµ ZLOO RQO\ FDXVH UHGXFHG SHUIRUPDQFH LQVWHDGRILQFRUUHFWH[HFXWLRQ $VDUHVXOWRIWKLVPRGHOSDUDOOHOL]LQJFRGHZLWK7&&LVDYHU\ GLIIHUHQW SURFHVV IURP FRQYHQWLRQDO SDUDOOHO SURJUDPPLQJ EH FDXVH LW DOORZV SURJUDPPHUV WR PDNH LQWHOOLJHQW WUDGHRIIV EH WZHHQSURJUDPPHUHIIRUWDQGSHUIRUPDQFH%DVLFSDUDOOHOL]DWLRQ FDQTXLFNO\DQGHDVLO\EHGRQHE\LGHQWLI\LQJSRWHQWLDOO\LQWHUHVW LQJ WUDQVDFWLRQV DQG WKHQ SURJUDPPHUV FDQ XVH IHHGEDFN IURP UXQWLPH YLRODWLRQ UHSRUWV WR UHÀQH WKHLU WUDQVDFWLRQ VHOHFWLRQ LQ RUGHUWRJHWVLJQLÀFDQWO\JUHDWHUVSHHGXSV,QDVLPSOLÀHGIRUP SDUDOOHOSURJUDPPLQJZLWK7&&FDQEHVXPPDUL]HGDVDWKUHH VWHSSURFHVV ‡ 'LYLGHLQWR7UDQVDFWLRQV7KHÀUVWVWHSLQWKHFUHDWLRQRID SDUDOOHOSURJUDPXVLQJ7&&LVWRFRDUVHO\GLYLGHWKHSURJUDP LQWREORFNVRIFRGHWKDWFDQUXQFRQFXUUHQWO\RQGLIIHUHQWSUR

 

  

   



  



 





 















  

)LJXUH7LPLQJLOOXVWUDWLRQRIKRZWUDQVDFWLRQV QXPEHUHG EORFNV UXQQLQJRQWKUHHGLIIHUHQWSURFHVVRUVDUHIRUFHGWR FRPPLWE\SKDVHQXPEHUVHTXHQFH FHVVRUV7KLVLVVLPLODUWRFRQYHQWLRQDOSDUDOOHOL]DWLRQZKLFK DOVRUHTXLUHVWKDWSURJUDPPHUVÀQGDQGPDUNSDUDOOHOUHJLRQV +RZHYHU WKH DFWXDO SURFHVV LV PXFK VLPSOHU ZLWK 7&& EH FDXVHWKHSURJUDPPHUGRHVQRWQHHGWRJXDUDQWHHWKDWSDUDOOHO UHJLRQVDUHLQGHSHQGHQWVLQFHWKH7&&KDUGZDUHZLOOFDWFKDOO GHSHQGHQFHYLRODWLRQVDWUXQWLPH ‡ 6SHFLI\2UGHU7KHSURJUDPPHUFDQRSWLRQDOO\VSHFLI\DQRU GHULQJEHWZHHQWUDQVDFWLRQVWRPDLQWDLQDSURJUDPRUGHUWKDW PXVWEHHQIRUFHG%\GHIDXOWQRRUGHULVLPSRVHGEHWZHHQ WKH FRPPLWV RI WKH YDULRXV WUDQVDFWLRQV VR GLIIHUHQW SURFHV VRUVPD\SURFHHGLQGHSHQGHQWO\DQGFRPPLWDVWKH\HQFRXQWHU HQGRIWUDQVDFWLRQLQVWUXFWLRQV+RZHYHUPRVWSDUDOOHODSSOL FDWLRQVKDYHSODFHVZKHUHFHUWDLQWUDQVDFWLRQVPXVWFRPSOHWH EHIRUH RWKHUV 7KLV VLWXDWLRQ FDQ EH DGGUHVVHG E\ DVVLJQLQJ KDUGZDUHPDQDJHGSKDVHQXPEHUVWRHDFKWUDQVDFWLRQ$WDQ\ SRLQWLQWLPHRQO\WUDQVDFWLRQVIURPWKH´ROGHVWµSKDVHSUHV HQWLQWKHV\VWHPDUHDOORZHGWRFRPPLW7UDQVDFWLRQVIURP ´QHZHUµSKDVHVDUHVLPSO\IRUFHGWRVWDOODQGZDLWLIWKH\FRP SOHWHEHIRUHDOO´ROGHUµSKDVHVKDYHFRPSOHWHG )LJLOOXVWUDWHVKRZPRVWLPSRUWDQWWUDQVDFWLRQVHTXHQFLQJ HYHQWVFDQEHKDQGOHGXVLQJSKDVHQXPEHUV7KHWRSKDOIRI WKHÀJXUHVKRZVJURXSVRIXQRUGHUHGWUDQVDFWLRQVIRUZKLFK ZHVLPSO\NHHSWKHSKDVHQXPEHUVLGHQWLFDO7RIRUPDEDU ULHU DOO SURFHVVRUV LQFUHPHQW WKH SKDVH QXPEHU RI WUDQVDF WLRQVE\DVWKH\FURVVWKHEDUULHUSRLQWVRWKDWDOOSUHEDU ULHUWUDQVDFWLRQVDUHIRUFHGWRFRPPLWEHIRUHDQ\SRVWEDUULHU WUDQVDFWLRQV FDQ FRPSOHWH 7R SDUDOOHOL]HVHTXHQWLDO FRGH LQ D 7/6OLNH IDVKLRQ ZH VLPSO\ LQFUHPHQW WKH SKDVH QXPEHU E\IRUHDFKWUDQVDFWLRQFUHDWHGIURPWKHRULJLQDOVHTXHQWLDO FRGHDVLVLOOXVWUDWHGLQWKHERWWRPKDOIRI)LJ7KLVIRUFHV WKHFRPPLWVWRRFFXULQRUGHUZKLFKLQWXUQJXDUDQWHHVWKDW WKH SDUDOOHO H[HFXWLRQ ZLOO PLPLF WKH ORDGVWRUH EHKDYLRU RI WKHRULJLQDOSURJUDP,QDGGLWLRQWREHLQJHDV\WRLPSOHPHQW LQ KDUGZDUH WKLV VFKHPH LV DOVR JXDUDQWHHG WR EH GHDGORFN IUHHVLQFHDWOHDVWRQHSURFHVVRULVDOZD\VUXQQLQJWKH´ROGHVWµ SKDVHDQGWKHUHIRUHDEOHWRFRPPLWZKHQLWFRPSOHWHV$OVR LQRUGHUWRDOORZVHYHUDOSKDVHSURJUHVVLRQVHTXHQFHVWRRF

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


FXUVLPXOWDQHRXVO\LQGLIIHUHQWSDUWVRIWKHV\VWHPZHFRXOG DGGDQRSWLRQDOVHTXHQFHQXPEHUWRWKHKDUGZDUHLQRUGHUWR VHSDUDWHRXWWKHVHGLIIHUHQWJURXSVRISKDVLQJV

PXOWLSOHELWVSHUOLQHLQRUGHUWRHOLPLQDWHIDOVHYLRODWLRQGH WHFWLRQVFDXVHGE\UHDGVDQGZULWHVWRGLIIHUHQWZRUGVLQWKH VDPHOLQH

‡ 3HUIRUPDQFH7XQLQJ$IWHUWUDQVDFWLRQVDUHVHOHFWHGDQGRU GHUHGWKHSURJUDPFDQEHUXQLQSDUDOOHO7KH7&&V\VWHPFDQ DXWRPDWLFDOO\SURYLGHLQIRUPDWLYHIHHGEDFNDERXWZKHUHYLR ODWLRQVRFFXULQWKHSURJUDPZKLFKFDQGLUHFWWKHSURJUDPPHU WRSHUIRUPIXUWKHURSWLPL]DWLRQV7KHVHRSWLPL]DWLRQVXVXDOO\ LPSURYHFRGHE\PDNLQJLWIROORZWKHVHJXLGHOLQHV

‡ 0RGLÀHGELW7KHUHPXVWEHRQHIRUHYHU\FDFKHOLQH7KHVH DUHVHWE\VWRUHVWRLQGLFDWHZKHQDQ\SDUWRIWKHOLQHKDVEHHQ ZULWWHQVSHFXODWLYHO\7KHVHDUHXVHGWRLQYDOLGDWHDOOVSHFXOD WLYHO\ZULWWHQOLQHVDWRQFHZKHQDYLRODWLRQLVGHWHFWHG



 7UDQVDFWLRQVVKRXOGEHFKRVHQWRPD[LPL]HSDUDOOHOLVPDQG PLQLPL]H WKH QXPEHU RI LQWHUWUDQVDFWLRQ GDWD GHSHQGHQ FLHV$IHZRFFDVLRQDOYLRODWLRQVDUHDFFHSWDEOHEXWUHJX ODUO\RFFXULQJRQHVZLOOODUJHO\HOLPLQDWHWKHSRVVLELOLW\IRU VSHHGXSLQPRVWV\VWHPV  /DUJH WUDQVDFWLRQV DUH SUHIHUDEOH ZKHQ SRVVLEOH DV WKH\ DPRUWL]HWKHVWDUWXSDQGFRPPLWRYHUKHDGWLPHEHWWHUWKDQ VPDOORQHVEXW  6PDOOWUDQVDFWLRQVVKRXOGEHXVHGZKHQYLRODWLRQVDUHIUH TXHQWWRPLQLPL]HWKHDPRXQWRIORVWZRUNRUZKHQODUJH QXPEHUVRIPHPRU\UHIHUHQFHVWHQGWRRYHUÁRZWKHDYDLO DEOHPHPRU\EXIIHULQJ

7KLVVWUHDPOLQHGSDUDOOHOFRGLQJPRGHOWKHUHIRUHDOORZVSDUDOOHO SURJUDPPHUVWRIRFXVRQSURYLGLQJEHWWHUSHUIRUPDQFHLQVWHDGRI VSHQGLQJPRVWRIWKHLUWLPHVLPSO\ZRUU\LQJDERXWFRUUHFWQHVV

%DVLF7&&6\VWHP 7&&ZLOOZRUNLQDZLGHYDULHW\RIPXOWLSURFHVVRUKDUGZDUHHQ YLURQPHQWVLQFOXGLQJDYDULHW\RI&03FRQÀJXUDWLRQVDQGVPDOO VFDOHPXOWLFKLSPXOWLSURFHVVRUV7&&FDQQRWVFDOHLQÀQLWHO\IRU WZRUHDVRQV)LUVW7&&UHTXLUHVV\VWHPZLGHDUELWUDWLRQIRUWKH FRPPLW SHUPLVVLRQ HLWKHU WKURXJK D FHQWUDOL]HG DUELWHU RU GLV WULEXWHGDOJRULWKP6HFRQG7&&UHOLHVRQEURDGFDVWWRVHQGWKH FRPPLWSDFNHWVWKURXJKRXWWKHV\VWHP7KHDOJRULWKPLVFXUUHQW O\GHSHQGHQWXSRQVRPHIRUPRIEURDGFDVWDOWKRXJKZHH[DPLQH VRPHPHFKDQLVPVWRUHGXFHEDQGZLGWKUHTXLUHPHQWVLQ6HFWLRQ 2XUVFKHPHFDQZRUNZLWKLQDQ\V\VWHPWKDWFDQVXSSRUW WKHVHWZRUHTXLUHPHQWVLQDQHIÀFLHQWPDQQHU ,QGLYLGXDOSURFHVVRUQRGHVZLWKLQD7&&V\VWHPPXVWKDYHVRPH IHDWXUHV WR SURYLGH VSHFXODWLYH EXIIHULQJ RI PHPRU\ UHIHUHQFHV DQG FRPPLW FRQWURO DV ZDV LOOXVWUDWHG LQ )LJ   (DFK ´QRGHµ FRQVLVWVRIDSURFHVVRUFRUHSOXVLWVRZQORFDOFDFKHKLHUDUFK\ 7KHH[DFWVWUXFWXUHRIWKHORFDOFDFKHKLHUDUFK\PDNHVQRGLIIHU HQFHWRWKHFRKHUHQFHVFKHPHDVORQJDVDOORIWKHLQFOXGHGOLQHV PDLQWDLQWKHIROORZLQJLQIRUPDWLRQLQVRPHZD\ ‡ 5HDGELWV7KHVHELWVDUHVHWRQORDGVWRLQGLFDWHWKDWDFDFKH OLQH RUSRUWLRQRIDOLQH KDVEHHQUHDGVSHFXODWLYHO\GXULQJ D WUDQVDFWLRQ 7KHVH ELWV DUH VQRRSHG ZKLOH RWKHU SURFHVVRU QRGHVFRPPLWWRGHWHUPLQHZKHQGDWDKDVEHHQVSHFXODWLYH O\ UHDG WRR HDUO\ ,I D ZULWH FRPPLWWHG E\ DQRWKHU SURFHVVRU PRGLÀHVDQDGGUHVVFDFKHGORFDOO\ZLWKLWVUHDGELWVHWWKHQ DYLRODWLRQKDVEHHQGHWHFWHGDQGWKHSURFHVVRULVLQWHUUXSWHG VRWKDWLWFDQUHYHUWEDFNWRLWVODVWFKHFNSRLQWDQGVWDUWUHH[ HFXWLQJIURPWKHUH,QDVLPSOHLPSOHPHQWDWLRQRQHUHDGELW SHUOLQHLVVXIÀFLHQW+RZHYHULWPD\EHGHVLUDEOHWRLQFOXGH

,QDGGLWLRQZHFDQRSWLRQDOO\LQFOXGHDQH[WUDVHWRIELWVLQHDFK FDFKHOLQHWRKHOSDYRLGIDOVHYLRODWLRQVWKDWFRXOGEHFDXVHGE\ UHDGVDQGZULWHVWRWKHVDPHSDUWRIDFDFKHOLQH ‡ 5HQDPHG ELWV 7KHVH RSWLRQDO ELWV PXVW EH DVVRFLDWHG ZLWK LQGLYLGXDOZRUGV RUHYHQE\WHV ZLWKLQHDFKFDFKHOLQH7KH\ DFW PXFK OLNH ´PRGLÀHGµ ELWV H[FHSW WKDW WKH\ FDQ RQO\ EH VHW LI WKH HQWLUH ZRUG RU E\WH  LV ZULWWHQ E\ D VWRUH LQVWHDG RIMXVWDQ\SDUWRIWKHDVVRFLDWHGUHJLRQ%HFDXVHLQGLYLGXDO VWRUHVFDQW\SLFDOO\RQO\ZULWHDVPDOOSDUWRIDFDFKHOLQHDWD WLPHWKHUHPXVWDOPRVWDOZD\VEHDODUJHQXPEHURIWKHVHELWV IRU HDFK OLQH ,I VHW DQ\ VXEVHTXHQW UHDGV IURP WKHVH ZRUGV E\WHV  GR QRW QHHG WR VHW UHDG ELWV EHFDXVH WKH\ DUH JXDU DQWHHG WR RQO\ EH UHDGLQJ ORFDOO\ JHQHUDWHG GDWD WKDW FDQQRW FDXVH YLRODWLRQV  6LQFH WKHVH ELWV DUH RSWLRQDO WKH\ FDQ EH RPLWWHGHQWLUHO\RURQO\SDUWLDOO\LPSOHPHQWHG IRUH[DPSOH LQDQRGH·V/FDFKHEXWQRWLQLWV/  &DFKH OLQHV ZLWK VHW UHDG RU PRGLÀHG ELWV PD\ QRW EH ÁXVKHG IURPWKHORFDOFDFKHKLHUDUFK\LQPLGWUDQVDFWLRQ,IFDFKHFRQ ÁLFWV RU FDSDFLW\ FRQVWUDLQWV IRUFH WKLV WR RFFXU WKH GLVFDUGHG FDFKH OLQHV PXVW EH PDLQWDLQHG LQ D YLFWLP EXIIHU ZKLFK PD\ MXVW KROG WKH WDJ DQG UHDG ELW V  LI D OLQH LV XQPRGLÀHG  RU WKH SURFHVVRUPXVWEHVWDOOHGWHPSRUDULO\,QWKHODWWHUFDVHLWPXVW UHTXHVWFRPPLWSHUPLVVLRQDSURFHVVWKDWPD\WDNHVRPHWLPHLI SURFHVVRUVZLWK´ROGHUµSKDVHVDUHSUHVHQWDQGWKHQKROGWKLVSHU PLVVLRQXQWLOWKHWUDQVDFWLRQFRPSOHWHVH[HFXWLRQDQGFRPPLWV 7KLVVROXWLRQZRUNVEHFDXVHUHDGDQGPRGLÀHGELWVGRQRWQHHG WREHPDLQWDLQHGRQFHFRPPLWSHUPLVVLRQKDVEHHQREWDLQHGDV DOO ´HDUOLHUµ FRPPLWV ZLOO KDYH EHHQ JXDUDQWHHG WR FRPSOHWH DW WKDWSRLQW+RZHYHUVLQFHKROGLQJWKHFRPPLWSHUPLVVLRQIRUH[ WHQGHGSHULRGVRIWLPHFDQKDYHDVHYHUHO\GHWULPHQWDOLPSDFWRQ WKHRYHUDOOV\VWHPSHUIRUPDQFHLWLVFULWLFDOWKDWWKLVPHFKDQLVP RQO\EHXVHGIRULQIUHTXHQWYHU\ORQJWUDQVDFWLRQV 7KHSURFHVVRUFRUHPXVWDOVRKDYHDZD\WRFKHFNSRLQWLWVUHJLVWHU VWDWHDWHDFKFRPPLWSRLQWLQRUGHUWRSURYLGHUROOEDFNFDSDELOL WLHV7KLVFRXOGEHGRQHHLWKHULQKDUGZDUHE\ÁDVKFRS\LQJWKH UHJLVWHU VWDWH WR D VKDGRZ UHJLVWHU ÀOH DW WKH EHJLQQLQJ RI HDFK WUDQVDFWLRQRULQVRIWZDUHE\H[HFXWLQJDVPDOOKDQGOHUWRÁXVK RXW WKH OLYH UHJLVWHU VWDWH DW WKH VWDUW RI HDFK WUDQVDFWLRQ  7KH KDUGZDUHVFKHPHFRXOGEHLQFRUSRUDWHGLQWRWUDGLWLRQDOUHJLVWHU UHQDPLQJ KDUGZDUH E\ ÁDVKFRS\LQJ WKH UHJLVWHU UHQDPH WDEOHV LQVWHDGRIWKHUHJLVWHUVWKHPVHOYHV7KHVRIWZDUHVFKHPHZRXOG QRWUHTXLUHDQ\PRGLÀFDWLRQVRIWKHFRUHDWDOOEXWVXFKDVFKHPH ZRXOGREYLRXVO\LQFXUDKLJKHURYHUKHDGRQWKHSURFHVVRUFRUH DWHDFKFRPPLW )LQDOO\WKHQRGHPXVWKDYHDPHFKDQLVPIRUFROOHFWLQJDOORILWV PRGLÀHG FDFKH OLQHV WRJHWKHU LQWR D FRPPLW SDFNHW  7KLV FDQ EH LPSOHPHQWHG DV D ZULWH EXIIHU FRPSOHWHO\ VHSDUDWH IURP WKH

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


FDFKHV RU DV DQ DGGUHVV EXIIHU WKDW PDLQWDLQV D OLVW RI WKH OLQH WDJVWKDWFRQWDLQGDWDWKDWQHHGVWREHFRPPLWWHG:HH[DPLQH WKHVL]HRIZULWHVWDWHLQ6HFWLRQWRGHWHUPLQHWKHDPRXQWRI KDUGZDUHWKDWZRXOGEHUHTXLUHGWRLPSOHPHQWDW\SLFDOZULWHEXI IHUEHVLGHWKHFDFKHV7KHLQWHUIDFHEHWZHHQWKLVEXIIHUDQGWKH V\VWHPQHWZRUNVKRXOGKDYHDVPDOO´FRPPLWFRQWUROµWDEOHWKDW WUDFNVWKHVWDWH SKDVHQXPEHUV RIRWKHUSURFHVVRUVLQWKHV\VWHP LQRUGHUWRGHWHUPLQHZKHQLWLVZLWKLQWKH´ROGHVWµSKDVHDQGIUHH WRDUELWUDWHIRUFRPPLWV7KLVVLPSOHPHFKDQLVPFDQHOLPLQDWHD JUHDWGHDORIVSXULRXVDUELWUDWLRQUHTXHVWWUDIÀF

5(/$7(':25. 7KLVSDSHUGUDZVXSRQLGHDVIURPWZRH[LVWLQJERGLHVRIZRUN GDWDEDVHWUDQVDFWLRQSURFHVVLQJV\VWHPVDQGWKUHDGOHYHOVSHFXOD WLRQ 7/6 DQGDSSOLHVWKHPWRWKHÀHOGRIFDFKHFRKHUHQWVKDUHG PHPRU\SDUDOOHODUFKLWHFWXUHV7KLVVHFWLRQFRPSDUHV7&&ZLWK VRPHNH\LGHDVIURPWKHVHWZRÀHOGVRINQRZOHGJH

'DWDEDVH7UDQVDFWLRQ3URFHVVLQJ 7UDQVDFWLRQVDUHDFRUHFRQFHSWLQGDWDEDVHPDQDJHPHQWV\VWHPV '%06 WKDWSURYLGHVLJQLÀFDQWEHQHÀWVWRWKHGDWDEDVHSURJUDP PHU>@,Q'%06WUDQVDFWLRQVSURYLGHWKHSURSHUWLHVRIDWRP LFLW\FRQVLVWHQF\LVRODWLRQDQGGXUDELOLW\ $&,' :HKDYHERU URZHGWKHIXOO\WUDQVDFWLRQDOSURJUDPPLQJPRGHOIURPGDWDEDVHV EHFDXVH ZH WKLQN WKDW WKHVH SURSHUWLHV ZLOO JUHDWO\ VLPSOLI\ WKH GHYHORSPHQW RI JHQHULF SDUDOOHO SURJUDPV7KH PDLQ GLIIHUHQFH EHWZHHQ WKH WUDQVDFWLRQV GHÀQHG E\ WKH GDWDEDVH SURJUDPPHU DQGWKRVHXVHGE\SDUDOOHOSURJUDPPHUVLVVL]H7KHQXPEHURI LQVWUXFWLRQVH[HFXWHGDQGWKHDPRXQWRIVWDWHJHQHUDWHGE\PRVW SDUDOOHOSURJUDPWUDQVDFWLRQVLVPXFKVPDOOHUWKDQWKRVHXVHGLQ GDWDEDVH WUDQVDFWLRQV 7KHUHIRUH D NH\ HOHPHQW WR XVLQJ WUDQV DFWLRQVIRUJHQHUDOSXUSRVHSDUDOOHOSURJUDPPLQJLVDQHIÀFLHQW KDUGZDUHEDVHGWUDQVDFWLRQH[HFXWLRQHQYLURQPHQW 7KH GHVLJQHUV RI '%06 KDYH H[SORUHG D ZLGH UDQJH RI LPSOH PHQWDWLRQ RSWLRQV IRU H[HFXWLQJ WUDQVDFWLRQV ZKLOH SURYLGLQJ KLJKWUDQVDFWLRQWKURXJKSXW7KHZRUNRQRSWLPLVWLFFRQFXUUHQF\ >@ LV WKH PRVW UHOHYDQW WR WKH LGHDV ZH H[SORUH LQ WKLV SDSHU 2SWLPLVWLFFRQFXUUHQF\FRQWUROVDFFHVVWRVKDUHGGDWDZLWKRXWXV LQJ ORFNV E\ GHWHFWLQJ FRQÁLFWV DQG EDFNLQJ XS WUDQVDFWLRQV WR HQVXUHFRUUHFWRSHUDWLRQ,QWUDQVDFWLRQDOFRKHUHQFHDQGFRQVLV WHQF\ZHH[WHQGWKHLGHDVRIRSWLPLVWLFFRQFXUUHQF\IURP'%06 WRPHPRU\V\VWHPKDUGZDUH

3UHYLRXV:RUNLQ7UDQVDFWLRQVDQG7/6 )URPWKHKDUGZDUHVLGHWKHRULJLQRIWKLVZRUNZDVLQWKHHDUO\ WUDQVDFWLRQDOPHPRU\ZRUNGRQHE\+HUOLK\>@DGHFDGHDJR 2XUWUDQVDFWLRQVKDYHLGHQWLFDOVHPDQWLFVWRWKHPRGHOSURSRVHG LQ WKLV SDSHU  +RZHYHU WKH\ SURSRVHG RQO\ XVLQJ WUDQVDFWLRQV RFFDVLRQDOO\ UHSODFLQJ RQO\ WKH FULWLFDO UHJLRQV RI ORFNV  $V D UHVXOW LW ZDV PRUH RI DQ DGMXQFW WR H[LVWLQJ VKDUHG PHPRU\ FRQVLVWHQF\SURWRFROVWKDQDFRPSOHWHUHSODFHPHQW%\UXQQLQJ WUDQVDFWLRQVDWDOOWLPHVLQVWHDGRIMXVWRFFDVLRQDOO\ZHDUHDEOH WRXVHWKHVDPHFRQFHSWVWRFRPSOHWHO\UHSODFHFRQYHQWLRQDOFR KHUHQFHDQGFRQVLVWHQF\WHFKQLTXHV+RZHYHUWKHODUJHUQXPEHU RI WUDQVDFWLRQ FRPPLWV LQ RXU PRGHO SXWV D JUHDW GHDO RI SUHV VXUHRQLQWHUSURFHVVRUFRPPXQLFDWLRQEDQGZLGWKVRSUDFWLFDOO\

VSHDNLQJLWZRXOGKDYHEHHQGLIÀFXOWWRLPSOHPHQWDPRGHOOLNH RXUVDGHFDGHDJR 2XUZRUNDOVRGUDZVXSRQWKHZLGHYDULHW\RIWKUHDGOHYHOVSHFX ODWLRQ 7/6 OLWHUDWXUHWKDWKDVEHHQSXEOLVKHGRYHUWKHFRXUVHRI WKH SDVW VHYHUDO \HDUV IURP WKH 0XOWLVFDODU SURMHFW >@ 6WDP SHGH >@  7RUUHOODV DW WKH 8QLYHUVLW\ RI ,OOLQRLV >@ DQG WKH +\GUD SURMHFW >@  ,Q IDFW D7&& V\VWHP FDQ DFWXDOO\ LPSOH PHQW D YHU\ ORRVHO\ FRXSOHG7/6 V\VWHP LI DOO WUDQVDFWLRQV DUH RUGHUHG VHTXHQWLDOO\  ,Q WKLV UHVSHFW LW PRVW FORVHO\ UHVHPEOHV WKH6WDPSHGHGHVLJQRID7/6V\VWHPDVWKHLU7/6WKUHDGVRQO\ ÁXVKRXWGDWDIURPWKHFDFKHWRJOREDOPHPRU\DWWKHHQGRIHDFK WKUHDGPXFKOLNHRXUFRPPLWV+RZHYHU6WDPSHGHOD\HUVWKH 7/6VXSSRUWRQWRSRIDFRQYHQWLRQDOFDFKHFRKHUHQFHSURWRFRO 7KHRWKHU7/6V\VWHPVSURYLGHPXFKWLJKWHUFRXSOLQJEHWZHHQ SURFHVVRUVDQGPRUHDXWRPDWLFIRUZDUGLQJRIGDWDEHWZHHQH[ HFXWLQJWKUHDGVDQGDUHWKHUHE\IXUWKHUUHPRYHGIURP7&&$V ORQJ DV IRUZDUGLQJ RI GDWD EHWZHHQ VSHFXODWLYH WKUHDGV LV QRW FULWLFDO IRU DQ DSSOLFDWLRQ KRZHYHU7&&·V SHUIRUPDQFH LQ ´DOO RUGHUHGWUDQVDFWLRQµPRGHFDQDFWXDOO\EHFRPSHWLWLYHZLWKWKHVH GHGLFDWHG7/6GHVLJQV /RRNLQJDWWKHSURSRVHGKDUGZDUHLPSOHPHQWDWLRQVRXUH[DPSOH LPSOHPHQWDWLRQLVPRVWVLPLODUWRWKH6WDPSHGHRU+\GUDGHVLJQV LQ LWV IRFXV RQ D PXOWLSURFHVVRU ZLWK D IHZ ÁDVKFOHDUDEOH ELWV DWWDFKHGWRWKHSULYDWHFDFKHV+RZHYHUZHFKRVHWKLVH[DPSOH GHVLJQVROHO\EHFDXVHLWLVDQHDV\ÀUVWVWHS$EXIIHULQJVFKHPH VXFKDVWKH$5%>@RU69&>@DVSURSRVHGE\WKH0XOWLVFD ODUJURXSZRXOGDOVREHDEOHWRKDQGOHWKHVSHFXODWLYHEXIIHULQJ WDVNVUHTXLUHGE\7&& &RPSDULVRQVFDQDOVREHPDGHEHWZHHQ7&&DQGRWKHUSURSRVDOV WRDGDSWVSHFXODWLYHPHFKDQLVPVWRLPSURYHWKHSHUIRUPDQFHRI FRQYHQWLRQDOSDUDOOHOSURJUDPPLQJPRGHOV)RUH[DPSOH0DUWL QH]DQG7RUUHOODV>@5DMZDUDQG*RRGPDQ>@DQG5XQG EHUJ DQG 6WHQVWURP >@ KDYH LQGHSHQGHQWO\ SURSRVHG KRZ WR VSHFXODWHWKURXJKORFNVDQGSDVWEDUULHUVLQUHFHQWSDSHUV7&& SHUIRUPVERWKRIWKHVHRSHUDWLRQVGXULQJQRUPDORSHUDWLRQ$OO H[HFXWLRQQRZFRQVLVWVRIWUDQVDFWLRQVWKDWFDQVSHFXODWHWKURXJK RQHRUVHYHUDOGLIIHUHQWFRQYHQWLRQDOORFNVDWRQFHZKLOHVSHFX ODWLRQSDVWSKDVHEDUULHUVFDQRFFXULIWKHLPSOHPHQWDWLRQRI7&& LVGRXEOHEXIIHUHG VHH6HFWLRQ 

7&&,03529(0(176 6HFWLRQSUHVHQWHGWKHEDVLFSURWRFROVXVHGWRFRQVWUXFWD7&& V\VWHPDQGKRZWKH\FRPSDUHZLWKH[LVWLQJFRKHUHQFHDQGFRQ VLVWHQF\ SURWRFROV  7KHUH DUH H[WHQVLRQV DQG LPSURYHPHQWV WR WKHVHEDVLFSURWRFROVWKDWFRXOGLPSURYHSHUIRUPDQFHRUUHGXFH WKHEDQGZLGWKUHTXLUHPHQWVRI7&&LQDUHDOV\VWHPHQYLURQPHQW 7KLVVHFWLRQGHVFULEHVVRPHRIWKHVHSRWHQWLDOLPSURYHPHQWV

'RXEOH%XIIHULQJ 'RXEOHEXIIHULQJLPSOHPHQWVH[WUDZULWHEXIIHUVDQGDGGLWLRQDO VHWVRIUHDGDQGPRGLÀHGELWVLQHYHU\FDFKHOLQHVRWKDWVXFFHV VLYH WUDQVDFWLRQV FDQ DOWHUQDWH EHWZHHQ VHWV RI ELWV DQG EXIIHUV 7KLVPHFKDQLVPDOORZVDSURFHVVRUWRFRQWLQXHZRUNLQJRQWKH QH[WWUDQVDFWLRQHYHQZKLOHWKHSUHYLRXVRQHLVZDLWLQJWRFRPPLW RUFRPPLWWLQJDVLVLOOXVWUDWHGLQ)LJIRUVHYHUDOFRPELQDWLRQV

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


,QDGGLWLRQWKLVH[WHQVLRQDXWRPDWLFDOO\OHWVSURFHVVRUVWKDWDU ULYHDWEDUULHUVHDUO\FRQWLQXHWRVSHFXODWHSDVWWKHPDVLQ> @ZLWKRXWDQ\DGGLWLRQDOKDUGZDUHRU$3,FRQVLGHUDWLRQV 7KHPDMRUH[SHQVHRIWKLVVFKHPHLVLQUHSOLFDWLQJDGGLWLRQDOVHWV RIVSHFXODWLYHFDFKHFRQWUROELWVDQGZULWHEXIIHUV$GGLQJEXI IHUVSUREDEO\ZLOOQRWVFDOHZHOOSDVWRQHDGGLWLRQDOVHWEXWWKDW ÀUVWVHW SURYLGLQJGRXEOHEXIIHULQJ VKRXOGSURYLGHPRVWRIWKH SRWHQWLDOEHQHÀW,IKDUGZDUHUHJLVWHUFKHFNSRLQWVDUHXVHGWKHQ DGGLWLRQDOVHWVRIVKDGRZUHJLVWHUVZRXOGDOVREHUHTXLUHGWRDO ORZFKHFNSRLQWVWREHWDNHQDWWKHEHJLQQLQJRIHDFKVSHFXODWLYH WUDQVDFWLRQVXSSRUWHGE\KDUGZDUH2QO\RQHFRS\RIWKHRSWLRQDO UHQDPHGELWVLVHYHUQHFHVVDU\VLQFHWKHVHGRQRWVHUYHDQ\IXU WKHU IXQFWLRQ DIWHU D WUDQVDFWLRQ ÀQLVKHV H[HFXWLQJ LQVWUXFWLRQV $OO RI WKHVH VSHFLDO ELWV VKRXOG EH ÁDVKFOHDUDEOH DW WKH HQG RI WUDQVDFWLRQVWRDYRLGW\LQJXSWKHKDUGZDUHIRUPDQ\F\FOHVRQ HDFKFRPPLWDQGLWLVDOVRKHOSIXOLIWKHPRGLÀHGELWVFDQÁDVK LQYDOLGDWHWKHLUOLQHVZKHQWKHWUDQVDFWLRQDERUWVRQDYLRODWLRQ 0RUHVRSKLVWLFDWHGYHUVLRQLQJSURWRFROVVLPLODUWRWKRVHXVHGLQ WKH69&LPSOHPHQWDWLRQRIEXIIHULQJIRU7/6>@FRXOGEHXVHG WRHOLPLQDWHPDQ\RIWKHVHFLUFXLWGHVLJQLVVXHVEXWZHEHOLHYH WKDWÁDVKFOHDUDEOHELWVDUHIHDVLEOHLQWDJ65$0V

D

+DUGZDUHFRXOGGLYLGHSURJUDPH[HFXWLRQLQWRWUDQVDFWLRQVDXWR PDWLFDOO\DVWKHVSHFXODWLYHEXIIHUVRYHUÁRZ7KLVFRXOGDFKLHYH WKH RSWLPDO WUDQVDFWLRQ VL]H E\ QRW OHWWLQJ WUDQVDFWLRQV JHW VR ODUJHWKDWEXIIHULQJEHFRPHVDSUREOHPZKLOHNHHSLQJWKHPODUJH HQRXJK WR PLQLPL]H WKH LPSDFW RI DQ\ FRPPLW RYHUKHDG  7KH PRVW FRPPRQ VLWXDWLRQ ZKHQ WKLV PLJKW EH KHOSIXO LV IRU FRGH WKDW GLYLGHV ´QDWXUDOO\µ LQWR YHU\ ODUJH WUDQVDFWLRQV EXW ZKHQ WKHVHWUDQVDFWLRQVFDQEHIUHHO\VXEGLYLGHGLQWRVPDOOHUWUDQVDF WLRQV7KLVLVDFRPPRQVLWXDWLRQLQSURJUDPVWKDWKDYHDOUHDG\ EHHQ SDUDOOHOL]HG LQ D FRQYHQWLRQDO PDQQHU  :KLOH LW LV IDLUO\ HDV\ IRU D VRIWZDUH SURJUDPPHU WR LQVHUW H[WUD FRPPLW SRLQWV LQWRWKHPLGGOHRIWKHVHWUDQVDFWLRQVLQRUGHUWRNHHSWKHVSHFXOD WLYHEXIIHULQJUHTXLUHPHQWVPDQDJHDEOHLWZRXOGEHVLPSOHUIRU KDUGZDUH WR DXWRPDWLFDOO\ LQVHUW WUDQVDFWLRQ FRPPLWV ZKHQHYHU WKHVSHFXODWLYHEXIIHUVDUHÀOOHGWKHUHE\DXWRPDWLFDOO\EUHDNLQJ XSWKHODUJHWUDQVDFWLRQLQWRWUDQVDFWLRQVWKDWDUHVL]HGSHUIHFWO\ IRUWKHDYDLODEOHVSHFXODWLYHEXIIHUVL]HV7KHRQO\OLPLWDWLRQRQ WKLV WHFKQLTXH LV WKDW LW QR ORQJHU JXDUDQWHHV DWRPLF H[HFXWLRQ VHPDQWLFVZLWKLQWKHODUJHWUDQVDFWLRQDVWKHKDUGZDUHLVIUHHWR LQVHUW DQ H[WUD WUDQVDFWLRQ FRPPLW SRLQW DQ\ZKHUH  7KLV OLPL WDWLRQ FDQ EH RYHUFRPH KRZHYHU E\ DOORZLQJ SURJUDPPHUV WR H[SOLFLWO\ PDUN WKH FULWLFDO UHJLRQV ZLWKLQ WKH ODUJH WUDQVDFWLRQ ZKHUHKDUGZDUHFDQQRWLQVHUWFRPPLWV,IWKHEXIIHUVRYHUÁRZ ZLWKLQWKHVHUHJLRQVWKHQWKHSURFHVVRUDFTXLUHVFRPPLWSHUPLV VLRQHDUO\DQGKROGVLWXQWLOWKHHQGRIWKH´DWRPLFUHJLRQµZKHQ LWÀQDOO\LQVHUWVDFRPPLW ,QVWHDGRIEUHDNLQJODUJHWUDQVDFWLRQVLQWRVPDOOHURQHVZHPLJKW DOVRZDQWWRKDYHWKHKDUGZDUHDXWRPDWLFDOO\PHUJHVPDOOWUDQV

 



E

  

  

 

     



   

 

  

  

  

  

 

 

F

 

 

  

 

  

  

 

  

  

 

G

 

+DUGZDUH&RQWUROOHG7UDQVDFWLRQV ,QWKHEDVH7&&V\VWHPSURJUDPPHUVH[SOLFLWO\PDUNDOOWUDQVDF WLRQERXQGDULHV+RZHYHULWLVDOVRSRVVLEOHIRUKDUGZDUHWRSOD\ DUROHLQPDUNLQJWUDQVDFWLRQERXQGDULHVRUVHTXHQFLQJWKHWUDQV DFWLRQ FRPPLWV RQFH WKH\ KDYH EHHQ LQLWLDWHG 7KHUH DUH WKUHH VLWXDWLRQVZKHUHKDUGZDUHDVVLVWDQFHZRXOGEHKHOSIXO

 

    

  

)LJXUH7KHHIIHFWRIGRXEOHEXIIHULQJD DVDPSOH WUDQVDFWLRQWLPHOLQHE GRXEOHEXIIHULQJRIDOOVSHFXODWLYH VWDWHF GRXEOHEXIIHULQJIRUZULWHEXIIHUEXWQRWUHDGELWVLQ FDFKHDQGG SXUHVLQJOHEXIIHULQJ DFWLRQVWRJHWKHULQWRODUJHURQHV7KLVZRXOGDOORZXVWRDXWR PDWLFDOO\JDLQVRPHRIWKHDGYDQWDJHVRIODUJHUWUDQVDFWLRQVEXW DWWKHSRWHQWLDOULVNRIKDYLQJWKHKDUGZDUHPHUJHFULWLFDOWUDQVDF WLRQVWKDWQHHGWRFRPSOHWHDQGSURSDJDWHUHVXOWVTXLFNO\ $QRWKHUXVHIXOZD\WKDWKDUGZDUHFRXOGLQWHUDFWZLWKWUDQVDFWLRQV LV WR RFFDVLRQDOO\ LQVHUW ´EDUULHUVµ LQWR ORQJ VWUHWFKHV RI XQRU GHUHG WUDQVDFWLRQV 7KLV ZRXOG VLPSO\ FRQVLVW RI LQFUHPHQWLQJ WKH SKDVH QXPEHU DVVLJQHG WR DOO QHZ WUDQVDFWLRQV HYHQ LI WKH VRIWZDUH GRHV QRW UHTXHVW D EDUULHU H[SOLFLWO\7KHVH RFFDVLRQDO SVHXGREDUULHUV ZRXOG IRUFH DOO FXUUHQWO\ H[HFXWLQJ WUDQVDFWLRQV WRFRPPLWEHIRUHDOORZLQJWKHV\VWHPWRSURJUHVVIXUWKHUHIIHF WLYHO\IRUFLQJDOOSURFHVVRUVWRPDNHIRUZDUGSURJUHVV:LWKRXW WKLVPHFKDQLVPWKHUHLVWKHSRVVLELOLW\RIVWDUYDWLRQDORQJWUDQV DFWLRQWKDWPDNHVQRIRUZDUGSURJUHVVEHFDXVHLWJHWVLQWRDQLQ ÀQLWHORRSRIYLRODWLRQVDQGUHVWDUWV

/RFDOL]DWLRQRI0HPRU\5HIHUHQFHV 2QHRIWKHDVVXPSWLRQVPDGHVRIDULVWKDWDOOORDGVDQGVWRUHV PXVW EH VSHFXODWLYHO\ EXIIHUHG DQG EURDGFDVW WKURXJKRXW WKH V\VWHP+RZHYHULWLVRIWHQSRVVLEOHIRUSURJUDPPHUVRUFRP SLOHUV WR JLYH KLQWV WR WKH KDUGZDUH WKDW FRXOG UHGXFH WKH QHHG IRU EXIIHULQJ DQG HVSHFLDOO\ IRU EURDGFDVW  )RU H[DPSOH RQH

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


7DEOH$VXPPDU\RIDSSOLFDWLRQVDQDO\]HGIRUWUDQVDFWLRQDOEHKDYLRU 6RXUFH 63(&)3>@

63(&)3>@ 63/$6+>@

63(&MEE>@ -DYD*UDQGH>@

M%<7(PDUN>@

63(&MYP>@ -DYD&RGH>@

3URJUDP DUW HTXDNH VZLP WRPFDWY OX UDGL[ ZDWHU1 63(&MEE HXOHU IIW PROG\Q UD\WUDFH ME\WHB% ME\WHB% ME\WHB% ME\WHB% PWUW VKDOORZ

6XPPDU\ LPDJHUHFRJQLWLRQQHXUDOQHWV VHLVPLFZDYHVLPXODWLRQ VKDOORZZDWHUPRGHO YHFWRUL]HGPHVKJHQHUDWLRQ GHQVHPDWUL[IDFWRUL]DWLRQ UDGL[VRUW 1ERG\PROHFXODUG\QDPLFV WUDQVDFWLRQSURFHVVLQJ ÁRZHTXDWLRQVLQLUUHJXODUPHVK ))7NHUQHO SDUWLFOHPRGHOLQJ 'UD\WUDFHU UHVRXUFHDOORFDWLRQ GDWDHQFU\SWLRQ QHXUDOQHWZRUN GHQVHPDWUL[IDFWRUL]DWLRQ UD\WUDFHU VKDOORZZDWHUPRGHO

ZD\ WR UHGXFH EDQGZLGWK LV E\ PDUNLQJ VRPH ORDGV DQG VWRUHV DV´ORFDOµRQHVWKDWGRQRWQHHGWREHEURDGFDVW:HDSSOLHGWKLV RSWLPL]DWLRQ WR VWDFN UHIHUHQFHV LQ RXU DQDO\VLV EHFDXVH WKHVH UHIHUHQFHVDUHJXDUDQWHHGWREHORFDOZLWKLQSURFHVVRUVLQPRVW SDUDOOHOV\VWHPVDQGGRQRWQHHGWREHVQRRSHGE\RWKHUSURFHV VRUV+RZHYHUWKHUHDUHDOVRRWKHUGDWDVWUXFWXUHVWKDWPLJKWEH NQRZQDV´ORFDORQO\µWRWKHSURJUDPPHUDQGRUFRPSLOHU7KHVH GDWDVWUXFWXUHVFRXOGEHPDUNHGHLWKHUE\ORFDWLQJWKHPWRJHWKHU LQPHPRU\SDJHVPDUNHGE\WKH26DV´ORFDORQO\µRUE\DFFHVV HVXVLQJVSHFLDO´ORDGORFDOµDQG´VWRUHORFDOµRSFRGHV(LWKHU ZRXOGDOORZWKHKDUGZDUHWRÀOWHURXWWKHVHORFDOUHIHUHQFHVDQG ZKLOHVWLOOPDUNLQJDQ\FKDQJHVIRUVSHFXODWLYHUROOEDFNLIQHFHV VDU\DYRLGDGGLQJWKHPWRWKHOLVWRIGDWDWREHEURDGFDVW

,2+DQGOLQJ $7&&V\VWHPFDQKDQGOH,2YHU\HDVLO\7KHNH\FRQVWUDLQWLV WKDW D WUDQVDFWLRQ FDQQRW YLRODWH DQG UROOEDFN DIWHU LQSXW LV RE WDLQHG:KHQDQDWWHPSWLVPDGHWRUHDGLQSXWWKHFXUUHQWWUDQV DFWLRQLPPHGLDWHO\UHTXHVWVFRPPLWSHUPLVVLRQMXVWDVLIDEXIIHU KDGRYHUÁRZHG7KHLQSXWLVRQO\UHDGDIWHUFRPPLWSHUPLVVLRQ LVREWDLQHGZKHQWKHWUDQVDFWLRQLVJXDUDQWHHGWRQHYHUUROOEDFN 2XWSXWVWKDWUHTXLUHZULWHVWRRFFXULQDVSHFLÀFRUGHU OLNHDQHW ZRUNLQWHUIDFH FDQXVHDVLPLODU´SVHXGRRYHUÁRZµWHFKQLTXHWR IRUFHWKHZULWHVWRSURSDJDWHRXWIURPWKHSURFHVVRULPPHGLDWHO\ DVVWRUHVDUHPDGH2QWKHRWKHUKDQGRXWSXWVWKDWFDQDFFHSW SRWHQWLDOO\UHRUGHUHGZULWHV VXFKDVDIUDPHEXIIHU PD\VLPSO\ EH XSGDWHG DW FRPPLW WLPH DORQJ ZLWK QRUPDO PHPRU\ ZULWHV WKHUHE\ DOORZLQJ KLJKHU SHUIRUPDQFH $V D UHVXOW H[LVWLQJ ,2 KDQGOHUV ZLOO ZRUN RQ 7&& V\VWHPV DOWKRXJK LW ZLOO SUREDEO\ LPSURYH SHUIRUPDQFH LI WUDQVDFWLRQ EUHDNSRLQWV DUH FDUHIXOO\ SODFHGZLWKLQWKHP´3VHXGRRYHUÁRZVµWRHQGWUDQVDFWLRQVSUH PDWXUHO\PD\DOVREHKHOSIXOZKHQHYHQWVVXFKDVV\VWHPFDOOV DQGH[FHSWLRQVRFFXUEXWWKLVLVQRWQHFHVVDULO\UHTXLUHG

'DWDVHW

3DUDOOHOL]DWLRQ

5HIHUHQFH 5HIHUHQFH [JULGLWHUDWLRQV [JULGLWHUDWLRQV [EORFNVL]H  .LQWHJHUVUDGL[ PROHFXOHV ZDUHKRXVHLWHUDWLRQV [PHVK VDPSOHV [[[ [LPDJH [DUUD\ ³ [[QHWZRUN [PDWUL[ [LPDJH [JULG

7/6ORRSVPDQXDOÀ[HV 7/6ORRSVPDQXDOÀ[HV FRPSLOHU FRPSLOHU DOOPDQXDO DOOPDQXDO DOOPDQXDO DOOPDQXDO DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG DXWRPDWHG7/6EDVHG

6,08/$7,210(7+2'2/2*< $VWKLVSDSHULVDQLQLWLDOHYDOXDWLRQWRGHWHUPLQHWKHRYHUDOOSR WHQWLDORI7&&ZHFKRVHWRVLPXODWHDYDULHW\RISDUDOOHOLQWHJHU DQGÁRDWLQJSRLQWEHQFKPDUNVXVLQJDVLPSOLÀHGKDUGZDUHPRGHO WKDWLQFOXGHGPDQ\DGMXVWDEOHSDUDPHWHUVWRPRGHODZLGHUDQJH RI SRWHQWLDO LPSOHPHQWDWLRQV RI 7&& V\VWHPV  2XU VHOHFWLRQ RI DSSOLFDWLRQV DQG WKHLU DVVRFLDWHG GDWDVHWV DUH VXPPDUL]HG LQ 7DEOH7KHVHDSSOLFDWLRQVFRPHIURPDZLGHYDULHW\RIGLIIHU HQWDSSOLFDWLRQGRPDLQVKDQGSDUDOOHOL]HG63/$6+>@SUR JUDPVVHYHUDOÁRDWLQJSRLQW63(&DQG>@EHQFKPDUNV SDUDOOHOL]HGVHPLDXWRPDWLFDOO\ZLWKKHOSIURPHLWKHUDFRPSLOHU IRU)RUWUDQ RU7/6 IRU& WKH63(&MEEWUDQVDFWLRQSURFHVVLQJ EHQFKPDUN>@DQGDYDULHW\RI-DYDSURJUDPVSDUDOOHOL]HGXVLQJ DXWRPDWHG7/6WHFKQLTXHV>@ZKLOHUXQQLQJRQWKH.DIIH-90 >@:HSDUDOOHOL]HGWKH63(&MEEEHQFKPDUNZLWKLQRQO\RQH RILWVZDUHKRXVHVDPRUHGLIÀFXOWWDVNWKDQWKHXVXDOWHFKQLTXH RISDUDOOHOL]LQJEHWZHHQZDUHKRXVHVLQRUGHUWRGHPRQVWUDWHKRZ 7&&FDQUHSODFHFRPSOH[ORFNLQJVWUXFWXUHV (DFK RI WKHVH EHQFKPDUNV ZDV UXQ WKURXJK D WKUHHSDUW LQYHV WLJDWLYHSURFHVV 7KH ÀUVW SDUW FRQVLVWHG RI H[DPLQLQJH[LVWLQJ EHQFKPDUNVDQGLQVHUWLQJPDUNHUVDWWKHHQGRIWUDQVDFWLRQVDQG WR UHSODFH FRQYHQWLRQDO LQWHUSURFHVVRU V\QFKURQL]DWLRQ  $IWHU ZDUGVZHUDQWKHDSSOLFDWLRQVRQDQH[HFXWLRQGULYHQVLPXODWRU À[HGWRH[HFXWHDWRQHLQVWUXFWLRQSHUF\FOHZLWKSHUIHFWFDFKH EHKDYLRU ZLWKUHDOFDFKHPLVVHVWKLV,3&ZLOOXVXDOO\DSSUR[L PDWH WKH SHUIRUPDQFH RI DQ DJJUHVVLYH VXSHUVFDODU SURFHVVRU  DQGSURGXFHWUDFHVRIDOOH[HFXWHGORDGVDQGVWRUHV H[FHSWVWDFN UHIHUHQFHVZKLFKZHUHJXDUDQWHHGWREHORFDOWRHDFKSURFHVVRU  LQWKHEHQFKPDUNV2QFHZHKDGREWDLQHGWUDFHVRISDUDOOHOH[ HFXWLRQ IURP RXU VHOHFWLRQ RI EHQFKPDUNV ZH IHG WKHVH WUDFHV LQWR DQ DQDO\]HU WKDW VLPXODWHG WKH HIIHFWV RI UXQQLQJ WKHP LQ SDUDOOHORQDYHU\ÁH[LEOHWUDQVDFWLRQDOV\VWHP2QWKLVV\VWHP ZHZHUHDEOHWRDGMXVWSDUDPHWHUVVXFKDVWKHQXPEHURISURFHV

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


6SHHGXS2YHU3URFHVVRU





c d f +

%

DUW

-

HTXDNHBO

+

HTXDNHBV

)

OX

f

UDGL[

e

63(&MEE

d

VZLP

c

WRPFDWY

0

ZDWHU



c d f + c f d + 0 e % ) c f + d 0 % e ) ) c + % 0 f -e d

 

0 % ) e -

)

c d f +

% ) % e 0

e



E

6SHHGXS2YHU3URFHVVRU



D













HXOHU

-

IIW

+

ME\WHB%

)

ME\WHB%

f

ME\WHB%

e

ME\WHB%

d

PROG\Q

c

PWUW

0

UD\WUDFH

k

VKDOORZ



) c 0 % + k f e d

 

1XPEHURI3URFHVVRUV

k )

%

) c k f + d e % 0 -

) k c f d + e % 0 -

) k c f d + e % 0 -



c

d + k ) c d +

% e

e % f 0

0 f

-

-







1XPEHURI3URFHVVRUV

)LJXUH6SHHGXSVIRUYDU\LQJQXPEHUVRISURFHVVRUVZLWKRXUPDQXDOO\SDUDOOHOL]HGEHQFKPDUNV D DQG-DYDEHQFKPDUNV ZLWKDXWRPDWHGSDUDOOHOLVP E RQDSHUIHFW7&&V\VWHPZLWK,3&SURFHVVRUVQRPHPRU\GHOD\VDQGÂ&#x2019;FRPPLWEDQGZLGWK VRUVWKHFRPPLWEXVEDQGZLGWKVSHFXODWLYHFDFKHOLQHELWFRQ Ã&#x20AC;JXUDWLRQV DQG WKH RYHUKHDGV DVVRFLDWHG ZLWK YDULRXV SDUWV RI WKH WUDQVDFWLRQDO SURWRFRO  7KH XQXVXDO FKDUDFWHULVWLFV RI 7&& DOORZVXFKDVLPSOHVLPXODWLRQHQYLURQPHQWWRVWLOOJHWUHDVRQ DEOH SHUIRUPDQFH HVWLPDWHV WKH Ã&#x20AC;[HG ,/3 GLG QRW PDWWHU PXFK EHFDXVH7&&LVDWKUHDGOHYHOSDUDOOHOLVPH[WUDFWLRQPHFKDQLVP ODUJHO\RUWKRJRQDOWR,/3H[WUDFWLRQZLWKLQWKHLQGLYLGXDOSURFHV VRUFRUHVDQGWKHIDFWWKDW7&&RQO\DOORZVSURFHVVRULQWHUDFWLRQ DWWUDQVDFWLRQFRPPLWPDGHWKHSUHFLVHWLPLQJRIORDGVDQGVWRUHV ZLWKLQWKHWUDQVDFWLRQVODUJHO\LUUHOHYDQW,QIDFWWKHPRVWFULWLFDO WLPLQJSDUDPHWHUWKDWZHREWDLQHGIURPVLPXODWLRQZDVWKHDS SUR[LPDWHF\FOHWLPHRIHQWLUHWUDQVDFWLRQV$VDUHVXOWZHZHUH DEOHWRVLPXODWHDZLGHYDULHW\RISRWHQWLDOV\VWHPFRQÃ&#x20AC;JXUDWLRQV ZLWK D UHDVRQDEOH DPRXQW RI VLPXODWLRQ WLPH :KLOH PRUH GH WDLOHGVLPXODWLRQVZLOOEHQHFHVVDU\WRLQYHVWLJDWHWKHIXOOSRWHQ WLDORI7&&WKLVUHODWLYHO\VLPSOHVWXG\KDVDOORZHGXVWRVKRZ ZKDWSDUWVRIWKHSDUDOOHOFRPSXWLQJGHVLJQVSDFHDUHDPHQDEOH WR FRQYHUVLRQ WR7&& DQG WR SURYLGH VRPH HVWLPDWHV DV WR WKH EXIIHULQJDQGEDQGZLGWKUHTXLUHPHQWVWKDWZLOOEHQHFHVVDU\IRU KDUGZDUHVXSSRUWRI7&&

6,08/$7,215(68/76 /LPLWVRI$YDLODEOH3DUDOOHOLVP 2XU Ã&#x20AC;UVW UHVXOWV VKRZ WKH OLPLWV RI SDUDOOHOLVP DYDLODEOH LQ RXU EHQFKPDUNV WKDW FDQ EH H[WUDFWHG ZLWK D 7&& V\VWHP  )LJ  VKRZV WKH VSHHGXSV WKDW FDQ EH REWDLQHG RQ WKHVH DSSOLFDWLRQV DVWKHQXPEHURISURFHVVRUVYDULHVIURPWRLQDQ´RSWLPDOµ 7&&V\VWHP7KLVSHUIHFWV\VWHPKDVLQÃ&#x20AC;QLWHFRPPLWEXVEDQG ZLGWKEHWZHHQWKHSURFHVVRUV7KHVSHHGXSVDFKLHYHGZLWKVHY HUDOEHQFKPDUNVDUHFORVHWRWKHRSWLPDOOLQHDUFDVHDQGPDQ\ RWKHUVDUHFRPSHWLWLYHZKHQFRPSDUHGZLWKSUHYLRXVO\SXEOLVKHG UHVXOWVREWDLQHGRQFRQYHQWLRQDOV\VWHPVLQSDSHUVVXFKDV>@

$VLVLOOXVWUDWHGLQ)LJWKHUHDUHVHYHUDOUHDVRQVZK\VSHHGXSV DUHOLPLWHG6HTXHQWLDOFRGHUHPDLQLQJLQVHYHUDORIWKHDSSOLFD WLRQVOLPLWVVSHHGXSWKURXJK$PGDKO·VODZ ´LGOHµWLPH /RDG LPEDODQFHLQSDUDOOHOUHJLRQVVORZVGRZQZDWHUDQGIIW ´ZDLW LQJµWLPH )LQDOO\ZKLOHZHZHUHJHQHUDOO\YHU\VXFFHVVIXODW HOLPLQDWLQJGHSHQGHQFLHVDPRQJWUDQVDFWLRQVVRPHDSSOLFDWLRQV VWLOOVXIIHUIURPRFFDVLRQDOYLRODWLRQVFDXVHGE\WUXHLQWHUWUDQV DFWLRQ GHSHQGHQFLHV UHPDLQLQJ LQ WKH SURJUDPV  )RU H[DPSOH LQ63(&MEEZHHOLPLQDWHGDOOORFNVSURWHFWLQJYDULRXVSDUWVRI WKHZDUHKRXVHGDWDEDVHV$VORQJDVPXOWLSOHWUDQVDFWLRQVGRQRW PRGLI\WKHVDPHREMHFWVVLPXOWDQHRXVO\WKH\PD\UXQLQSDUDOOHO EXWVLQFHZHKDYHPXOWLSOHSURFHVVRUVUXQQLQJZLWKLQWKHVDPH ZDUHKRXVHWKHUHLVDOZD\VDSUREDELOLW\WKDWVLPXOWDQHRXVPRGL Ã&#x20AC;FDWLRQVPD\FDXVHRQHRIWKHWUDQVDFWLRQVWRYLRODWH $VDIXUWKHUH[DPSOHRIWKLVZHVKRZUHVXOWVIRUHTXDNHSDUDOOHO L]HGLQWRYHUVLRQVZLWKERWKORQJ HTXDNHBO DQGVKRUW HTXDNHB V WUDQVDFWLRQV7KHORQJHUWUDQVDFWLRQVWHQGHGWRLQFXUPRUHYLR ODWLRQVDQGH[SHULHQFHGPXFKOHVVVSHHGXS2QEHQFKPDUNVOLNH WKLV WKH SRVLWLRQLQJ DQG IUHTXHQF\ RI WUDQVDFWLRQ FRPPLWV FDQ FOHDUO\EHFULWLFDO:HDOVRSDUDOOHOL]HGVHYHUDOYHUVLRQVRIUD GL[WRKDYHGLIIHUHQWWUDQVDFWLRQVL]HV+RZHYHUUDGL[KDVEHHQ PDQXDOO\WXQHGWRHOLPLQDWHGHSHQGHQFLHVDQGORDGLPEDODQFHEH WZHHQSURFHVVRUVVREDVHOLQHVSHHGXSVFKDQJHGYHU\OLWWOHDFURVV WKHYDULRXVYHUVLRQV

%XIIHULQJ5HTXLUHPHQWVIRU7\SLFDO 7UDQVDFWLRQV 7KH PRVW VLJQLÃ&#x20AC;FDQW KDUGZDUH FRVW RI D 7&& V\VWHP LV LQ WKH DGGLWLRQRIWKHVSHFXODWLYHEXIIHUVXSSRUWWRWKHORFDOFDFKHKL HUDUFK\ $V D UHVXOW LW LV FULWLFDO WKDW WKH DPRXQW RI VWDWH UHDG DQGRUZULWWHQE\DQDYHUDJHWUDQVDFWLRQEHVPDOOHQRXJKWREH EXIIHUHGRQFKLS7RJHWDQLGHDDERXWWKHVL]HRIWKHVWDWHVWRUDJH

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCAâ&#x20AC;&#x2122;04) 1063-6897/04 $ 20.00 © 2004 IEEE


3URFHVVRU$FWLYLW\

 ,GOH

 

9LRODWLQJ



:DLWLQJ



8VHG     

    

WRPFDWY

ZDWHU

    

    

    

UDGL[

    

    

OX

    

    

HTXDNHBV

    

    

HTXDNHBO

    

    

DUW

    



PWUW

UD\WUDFH

VKDOORZ

VZLP

63(&MEE

3URFHVVRU$FWLYLW\

         

ME\WHB%

    

ME\WHB%

    

IIW

    

HXOHU

    



ME\WHB%

ME\WHB%

PROG\Q

)LJXUH'LVWULEXWLRQRIH[HFXWLRQWLPHRQWKHSHUIHFW7&&V\VWHP·VSURFHVVRUVEHWZHHQXVHIXOZRUNYLRODWHGWLPH IDLOHG WUDQVDFWLRQV ZDLWLQJWLPH ORDGLPEDODQFHLQSDUDOOHOFRGH DQGLGOHWLPH WLPHZDLWLQJGXULQJVHTXHQWLDOFRGH   



  













 



   





PWUW

63(&MEE

DUW

ME\WHB%

VZLP

ME\WHB%

OX3

WRPFDWY

UDGL[P3

HTXDNHBO

UDGL[V3

HXOHU

ZDWHU3

IIW

VKDOORZ

ME\WHB%

UDGL[[V3

UD\WUDFH

ME\WHB%

PROG\Q

HTXDNHBV

 UDGL[O3

5HDG6WDWHLQ.% ZLWK%OLQHV

$SSOLFDWLRQV

)LJXUH6WDWHUHDGE\LQGLYLGXDOWUDQVDFWLRQVZLWKVWRUH EXIIHUJUDQXODULW\RIE\WHFDFKHOLQHV:HVKRZVWDWH UHTXLUHGE\WKHVPDOOHVWDQGRILWHUDWLRQV 

  











 



 

 



  

UDGL[O3

UDGL[V3

UDGL[P3

UDGL[[V3

OX3

WRPFDWY

PWUW

VZLP

IIW

DUW

ZDWHU3

ME\WHB%

HTXDNHBO

VKDOORZ

ME\WHB%

UD\WUDFH

ME\WHB%

HXOHU

ME\WHB%

PROG\Q

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE

$SSOLFDWLRQV

)LJXUH6DPHDV)LJEXWIRUZULWHVWDWH

63(&MEE3



 HTXDNHBV

6WDWHVL]HLVPRVWO\GHSHQGHQWXSRQWKHVL]HVRI´QDWXUDOµWUDQV DFWLRQDOFRGHUHJLRQVVXFKDVORRSERGLHVWKDWDUHDYDLODEOHIRU H[SORLWDWLRQZLWKLQDQDSSOLFDWLRQ$VVXFKLWLVYHU\DSSOLFDWLRQ GHSHQGHQWEXWJHQHUDOO\TXLWHUHDVRQDEOH:LWKWKHH[FHSWLRQRI PWUW DQG 63(&MEE DOO RI RXU EHQFKPDUNV ZRUNHG ÀQH ZLWKLQ DERXW.%RIUHDGVWDWH³ZHOOZLWKLQWKHVL]HRIHYHQWKH VPDOOHVWFDFKHVWRGD\³DQGDERXW.%RIZULWHVWDWH7KH EXIIHUKXQJU\DSSOLFDWLRQVJHQHUDOO\VWLOOKDGORZDQG EUHDNSRLQWVVRHYHQWKRVHZRXOGSUREDEO\ZRUNUHDVRQDEO\ZHOO ZLWKVPDOOEXIIHUVDOWKRXJKQRWLFHDEOHVHULDOL]DWLRQIURPEXIIHU RYHUÁRZZRXOGXQGRXEWHGO\RFFXU:KLOHRXUYDULRXVYHUVLRQV RIUDGL[GLGQRWYDU\PXFKLQWHUPVRIVSHHGXSWKH\YDULHGGUD PDWLFDOO\LQWKHVL]HRIWKHLUUHDGDQGZULWHVWDWH2XUUDGL[BODQG UDGL[B[O QRWSORWWHGEHFDXVHLWZDVVRODUJH YDULDWLRQVUHTXLUHG YHU\ ODUJH DPRXQWV RI VWDWH ZLWK HDFK WUDQVDFWLRQ  +RZHYHU LW ZDV UHODWLYHO\ HDV\ WR VFDOH WKHVH GRZQ WR VPDOOHU WUDQVDFWLRQV ZLWKOLWWOHLPSDFWRQWKHV\VWHPSHUIRUPDQFH%DVHGRQRXUH[ DPLQDWLRQ RI WKH FRGH PDQ\ GHQVHPDWUL[ DSSOLFDWLRQV VXFK DV VZLPDQGWRPFDWYVKRXOGKDYHVLPLODUSURSHUWLHV$Q\RIWKHVH ´WUDQVDFWLRQ VL]H WROHUDQWµ DSSOLFDWLRQV ZRXOG DOVR EH H[FHOOHQW WDUJHWV IRU XVH ZLWK KDUGZDUH FRPPLW FRQWURO DV GHVFULEHG LQ 6HFWLRQ  ZKLFK FRXOG KHOS WKH SURJUDPPHU VL]H WUDQVDFWLRQ UHJLRQVRSWLPDOO\IRUWKHDYDLODEOHEXIIHUVL]HV7KLVZRXOGEHHV SHFLDOO\KHOSIXOLIZLGHO\YDU\LQJGDWDVHWVPD\EHXVHGDVWUDQV DFWLRQVWKDWHQWLUHO\FRQWDLQLQQHUORRSVPD\YDU\LQVL]HDORQJ ZLWKWKHGDWDVHW



:ULWH6WDWHLQ.% ZLWK%OLQHV

UHTXLUHG)LJVDQGVKRZWKHVL]HRIWKHEXIIHUVQHHGHGWRKROG WKHVWDWHUHDGRUZULWWHQE\DQGRIHDFKDSSOLFD WLRQ·VWUDQVDFWLRQVVRUWHGE\WKHVL]HRIWKHOLPLW9LUWXDOO\ DOODSSOLFDWLRQVKDYHDIHZYHU\ODUJHWUDQVDFWLRQVWKDWZLOOGHÀ QLWHO\FDXVHRYHUÁRZEXWKDUGZDUHVKRXOGKDYHHQRXJKURRPWR DYRLGRYHUÁRZRQPRVWWUDQVDFWLRQVLQRUGHUWRNHHSWKHQXPEHU RIHDUO\FRPPLWSHUPLVVLRQFODLPVWRDPLQLPXPRUEHWWHU LVDJRRGLQLWLDOWDUJHWEXWHYHQIHZHURYHUÁRZVPD\EHQHFHV VDU\IRUJRRGSHUIRUPDQFHRQV\VWHPVZLWKPDQ\SURFHVVRUV


    

DUW

OHTXDNHV

OX

UDGL[ [OOPV[V

63(&MEE VZLP

    

    

    

    

    

    

    

    

    

    

    

    

     

WRPFDWY ZDWHU

         

    

ME\WHB%

    

ME\WHB%

    

ME\WHB%

    

IIW

    

    

HXOHU

    

    

     

$GGUHVVHVSHU&\FOH

)RU RXU ´SHUIHFWµ VDPSOH V\VWHP )LJ  VKRZV WKH DYHUDJH QXPEHU RI DGGUHVVHV WKDW PXVWEHEURDGFDVWRQHYHU\F\FOHLQRUGHUWR FRPPLWDOOZULWHVWDWHSURGXFHGE\DOOWUDQV DFWLRQVLQDV\VWHPZKHQWKHVWDWHLVVWRUHG DVE\WHFDFKHOLQHV:KLOHEXVDFWLYLW\LQ D7&&V\VWHPLVOLNHO\WREHEXUVW\WKHDYHU DJHEDQGZLGWKVDUHXVHIXOPHDVXUHVEHFDXVH RIWKHHDVHZLWKZKLFK7&&FRPPLWSDFNHWV PD\EHEXIIHUHGDVZDVGHVFULEHGLQ6HFWLRQ %HFDXVHWKHUHDUHQRGHOD\VLQRXUV\V WHPIRUFDFKHPLVVHVRUFRPPXQLFDWLRQFRQ WHQWLRQWKHVHVKRXOGEHFRQVLGHUHGDVDQXS SHUERXQGIRULQVWUXFWLRQVWUHDPVDYHUDJLQJ ,3&7KHVHQXPEHUVFDQEHVFDOHGXSWRLQ GLFDWHSRWHQWLDOPD[LPXPVIRU7&&V\VWHPV FRPSRVHGRIZLGHLVVXHVXSHUVFDODUFRUHVRU GRZQIRUVLPSOHSURFHVVRUV

$GGUHVVHVSHU&\FOH

/LPLWHG%XV%DQGZLGWK

ME\WHB%

PROG\Q

PWUW

UD\WUDFH

VKDOORZ

)LJXUH1XPEHURIE\WHFDFKHOLQHDGGUHVVHVEURDGFDVWSHUF\FOHLQRXU SHUIHFWV\VWHPZLWK,3&Â&#x2019;EXVEDQGZLGWKDQGQRFDFKHPLVVHV7KLVLQGLFDWHV HVVHQWLDOO\WKHPD[LPXPVQRRSUDWHSHUDYJSURFHVVRU,3&WKDWFRXOGEHH[SHFWHG %\WHVSHU&\FOH

  

UDGL[ [OOPV[V

VZLP

    

63(&MEE

    

    

    

    

    

    

    

OX

    

OHTXDNHV

    

DUW

    

    

    



WRPFDWY ZDWHU

%\WHVSHU&\FOH

 :KROH/LQHV



'LUW\'DWD2QO\



PROG\Q

PWUW

    

ME\WHB%

    

ME\WHB%

    

ME\WHB%

    

    

ME\WHB%

    

    

IIW

    

HXOHU

    

    



UD\WUDFH

VKDOORZ

)LJXUH$YHUDJHE\WHVSHUF\FOHEURDGFDVWE\D,3&V\VWHPZLWKLQÃ&#x20AC;QLWHEXV EDQGZLGWKQRFDFKHPLVVHVDQG7&&ZLWKDQXSGDWHSURWRFROE\WHVRI´DGGUHVV RYHUKHDGµSHUE\WHFDFKHOLQHDUHDVVXPHG %\WHVSHU&\FOH

  

UDGL[ [OOPV[V

VZLP

    

63(&MEE

    

    

    

    

    

    

    

OX

    

OHTXDNHV

    

DUW

    

    

    



WRPFDWY ZDWHU

 %\WHVSHU&\FOH

 

PROG\Q

    

ME\WHB%

PWUW

    

ME\WHB%

    

ME\WHB%

    

    

    

ME\WHB%

    

IIW

    

HXOHU

    

     

)RUDOORIRXUDSSOLFDWLRQVWKHQXPEHURIDG GUHVVHVSHUF\FOHLVZHOOEHORZRQHVRDVLQJOH VQRRSSRUWRQHYHU\SURFHVVRUQRGHVKRXOGEH VXIÃ&#x20AC;FLHQWIRUGHVLJQVRIXSWRSURFHVVRUV DQGFDQSUREDEO\VFDOHXSWRDERXWVLPSOH SURFHVVRUV RU D VPDOOHU QXPEHU RI ZLGHLV VXH VXSHUVFDODU SURFHVVRUV EHIRUH DGGLWLRQDO VQRRS EDQGZLGWK ZRXOG EH UHTXLUHG 7KHVH UHVXOWVDOVRLQGLFDWHWKDWVPDOO7&&V\VWHPV XVLQJ DQ LQYDOLGDWH SURWRFRO ZRXOG XVXDOO\ SURGXFHOHVVWKDQDERXWE\WHVF\FOHZLWK ELWDGGUHVVHV2QWKHRWKHUKDQGLIDQXS GDWHSURWRFROLVXVHGWKHQWKHDPRXQWRIGDWD EHLQJEURDGFDVWPD\VWLOOEHSURGLJLRXVDVLV VKRZQLQ)LJ2QVRPHRIWKHDSSOLFDWLRQV DERXW D KDOI RI WKLV VDPSOH  ZH PD\ HYHQ EH EURDGFDVWLQJ PRUH GDWD WKDQ D SURFHVVRU H[HFXWLQJDZULWHWKURXJKEDVHGFDFKHFRKHU HQF\ PHFKDQLVP DV LOOXVWUDWHG LQ )LJ  ZLWKDKLJKRIQHDUO\E\WHVSHUF\FOHIURP WKHYHUVLRQVRIUDGL[ZLWKVPDOOWUDQVDFWLRQV DQGWKHUHIRUHPRUHIUHTXHQWFRPPLWV IRU SURFHVVRUV\VWHPV:KLOH7&&DOORZVZULWHV WREHFRPELQHGWRJHWKHULQWREXIIHUHGFDFKH OLQHVRYHUWKHFRXUVHRIDWUDQVDFWLRQWKHFRP PLWWLQJRIH[WUD´FOHDQµVHFWLRQVRISDUWLDOO\ PRGLÃ&#x20AC;HGOLQHVLQWKHZULWHVWDWHFDQSXVKXS WKHRYHUDOOEDQGZLGWKUHTXLUHPHQWVGUDPDWL FDOO\7KLVSUREOHPFDQEHDOPRVWFRPSOHWHO\ RYHUFRPHE\PRGLI\LQJWKHFRPPLWEURDGFDVW XQLWWRRQO\VHQGRXWPRGLÃ&#x20AC;HGSDUWVRIFRP PLWWLQJFDFKHOLQHVOLPLWLQJEDQGZLGWKWRMXVW WKHEODFNSDUWRIWKHEDUVLQ)LJDQGOLPLWLQJ WKHDPRXQWRIEURDGFDVWEDQGZLGWKUHTXLUHG WRDERXWE\WHVSHUF\FOHHYHQRQWKHZRUVW FDVHDSSOLFDWLRQVOLNHOXVZLPDQGWRPFDWY )RUPRUHW\SLFDODSSOLFDWLRQVDUDQJHRI² E\WHVSHUF\FOHZRXOGEHVXIÃ&#x20AC;FLHQW

UD\WUDFH

VKDOORZ

)LJXUH)RUFRPSDULVRQDYHUDJHE\WHVSHUF\FOHEURDGFDVWRQDV\VWHPXVLQJD ZULWHWKURXJKEDVHGPHFKDQLVP DGGUHVVGDWD E\WHVZULWH 

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCAâ&#x20AC;&#x2122;04) 1063-6897/04 $ 20.00 © 2004 IEEE




:KLOHWKHSUHYLRXVUXQVZLWK´SHUIHFWµKDUG ZDUH DUH KHOSIXO IRU GHWHUPLQLQJ LI 7&& LV D YLDEOH LGHD WKH\ GR QRW VKRZ KRZ D UHDO 7&& V\VWHP ZLOO ZRUN LQ SUDFWLFH ZKHUH LVVXHV OLNH Ã&#x20AC;QLWH EXV EDQGZLGWKV UHGXFHG QXPEHUVRIUHDGVWDWHELWVOLPLWHGEXIIHULQJ DQGWLPHWRKDQGOHWKHYDULRXVSURWRFRORYHU KHDGV FDQ DOO EH VLJQLÃ&#x20AC;FDQW OLPLWLQJ IDFWRUV RQVSHHGXS7KLVVHFWLRQDWWHPSWVWRORRNDW DIHZRIWKHVHLVVXHVE\YDU\LQJVRPHRIWKH SDUDPHWHUVZLWKDQSURFHVVRUV\VWHP

  

DUW

OHTXDNHV

OX

UDGL[ [OOPV[V

63(&MEE VZLP

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    



WRPFDWY ZDWHU

6SHHGXS

   

PROG\Q

Â&#x2019;    

ME\WHB%

PWUW

Â&#x2019;    

ME\WHB%

Â&#x2019;    

ME\WHB%

Â&#x2019;    

Â&#x2019;    

Â&#x2019;    

ME\WHB%

Â&#x2019;    

IIW

Â&#x2019;    

HXOHU

Â&#x2019;    

Â&#x2019;    



UD\WUDFH

VKDOORZ

)LJXUH7KHHIIHFWRIÃ&#x20AC;QLWHFRPPLWEXVEDQGZLGWK DQGE\WHVSHU F\FOH RQWKHVSHHGXSRISURFHVVRUV\VWHPVZULWLQJE\WH %KHDGHU% GDWD FDFKHOLQHVGXULQJFRPPLW RURUF\FOHSHUFRPPLWWHGOLQH   6SHHGXS

  

DUW

OHTXDNHV

OX

UDGL[ [OOPV[V

63(&MEE VZLP

    

    

    

    

    

    

    

    

    

    

    

    

     

:HVLPXODWHGÃ&#x20AC;QLWHEXVEDQGZLGWKVUDQJLQJ IURPYHU\KLJK E\WHVF\FOHDIXOOFDFKH OLQHFRPPLWWHGSHUF\FOH WROHYHOVWKDWZRXOG EHUHDVRQDEOHLQDKLJKSHUIRUPDQFH&03RU HYHQ D SRWHQWLDOO\ ERDUGOHYHO V\VWHP ZLWK D KLJKSHUIRUPDQFH LQWHUFRQQHFW DQG SUHV HQWWKHUHVXOWVLQ)LJ0RVWDSSOLFDWLRQV ZHUHUHODWLYHO\LQVHQVLWLYHWRWKHVHOHYHOVRI EDQGZLGWKOLPLWVEXWDIHZWKDWKDGDODUJH ZULWHVWDWHDQGUHODWLYHO\VKRUWWUDQVDFWLRQV QRWDEO\ IIW H[SHULHQFHG VRPH GHJUDGDWLRQ /DUJHU QXPEHUV RI SURFHVVRUV RU DQ HYHQ PRUHFRQVWUDLQHGLQWHUFRQQHFWDUHQHFHVVDU\ IRU EDQGZLGWK WR EHFRPH D PDMRU OLPLWLQJ IDFWRUIRU7&&V\VWHPV

6SHHGXS

2WKHU/LPLWHG+DUGZDUH

WRPFDWY ZDWHU

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCAâ&#x20AC;&#x2122;04) 1063-6897/04 $ 20.00 © 2004 IEEE

    

    

    

    

    

    

    

    

    

    

6SHHGXS

,QDGGLWLRQZHWULHGPDNLQJWKHWLPLQJRYHU  KHDG UHTXLUHG IRU FRPPLW SHUPLVVLRQ DUEL  WUDWLRQ QRQ]HUR ZLWK WLPHV UDQJLQJ IURP   F\FOHV QHFHVVDU\ IRU DUELWUDWLRQ DFURVV D ODUJHFKLS WRF\FOHV ZKLFKPD\EHQHF  HVVDU\ RQ D ODUJHU ERDUGVL]H V\VWHP  DQG  SUHVHQW WKHVH UHVXOWV LQ )LJ   63(&MEE 63/$6+ DSSOLFDWLRQV DQG FRPSLOHUSDUDO HXOHU IIW ME\WHB% ME\WHB% ME\WHB% ME\WHB% PROG\Q PWUW UD\WUDFH VKDOORZ OHOL]HG 63(& )3 DSSOLFDWLRQV ZKLFK KDYH EHHQ GHVLJQHG IRU XVH RQ ODUJH V\VWHPV )LJXUH7KHHIIHFWRQVSHHGXSRILQFUHDVLQJFRPPLWDUELWUDWLRQWLPH  ZHUHDOPRVWWRWDOO\LQVHQVLWLYHWRWKLVIDFWRU DQGF\FOHV IRUDV\VWHPZLWK&38VDQGDZLGHE\WHEXV 7/6GHULYHGDSSOLFDWLRQVRQWKHRWKHUKDQG VXUSULVLQJO\HQRXJK:KHQZHWXUQHGLWRIIQRWPXFKKDSSHQHG ZHUHRIWHQTXLWHVHQVLWLYHDVWKHLUWUDQVDFWLRQVWHQGHGWREHPXFK +RZHYHUWKHVHWHVWVZHUHSHUIRUPHGZLWKUHODWLYHO\SOHQWLIXOV\V VPDOOHU6LPLODUO\WKHYHUVLRQVRIHTXDNHDQGUDGL[WKDWKDGWKH WHPEDQGZLGWK6LQFHGRXEOHEXIIHULQJLVSULPDULO\DWHFKQLTXH VPDOOHVWWUDQVDFWLRQVL]HVVKRZHGPXFKPRUHGHJUDGDWLRQIURP WRDYRLGZDLWLQJIRUDEXV\EURDGFDVWPHGLXPLWVKRXOGVWLOOSURYH WKLVRYHUKHDGWKDQWKHYHUVLRQVZLWKORQJHUWUDQVDFWLRQV WREHXVHIXOLQPRUHEDQGZLGWKOLPLWHGHQYLURQPHQWV :H IRXQG WKDW RSWLRQDO VWDWH GHVFULEHG LQ 6HFWLRQV  DQG  SURYHG WR EH OHVV XVHIXO ZLWK RXU VHOHFWLRQ RI DSSOLFDWLRQV DQG )8785(:25. KDUGZDUHSDUDPHWHUV([WUDUHDGVWDWHELWV RQDSHUZRUGLQVWHDG 7KLVLQLWLDOLQYHVWLJDWLRQRI7&&VXJJHVWVPDQ\SRWHQWLDOGLUHF RISHUOLQHEDVLV XVXDOO\PDGHQRGLIIHUHQFHEXWZHUHHVVHQWLDO WLRQVIRUIXWXUHZRUN7KHPRVWFULWLFDOLVDQHYDOXDWLRQRI7&& ZLWKDIHZDSSOLFDWLRQV0RVWRIRXUPDQXDOO\SDUDOOHOL]HGDS ZLWK UHDOLVWLF KDUGZDUH PRGHOV IRU D &03 DQGRU D ERDUGOHYHO SOLFDWLRQVZHUHFDUHIXOO\WXQHGWRDYRLGWKH´IDOVHYLRODWLRQVµDV V\VWHP$GHWDLOHGHYDOXDWLRQRIWKH7&&SURJUDPPLQJHQYLURQ WKH\ZHUHDOUHDG\EORFNHGWRDYRLGIDOVHFDFKHVKDULQJEXWVRPH PHQWLVDOVRDSULRULW\VLQFHRQHRIWKHPDLQDGYDQWDJHVRI7&& RI WKH7/6SDUDOOHOL]HG DSSOLFDWLRQV ZKRVH GDWD VWUXFWXUHV KDG LV LWV VLPSOLÃ&#x20AC;HG SDUDOOHO SURJUDPPLQJ PRGHO  )XUWKHU RXW ZH QRWEHHQPRGLÃ&#x20AC;HGIRUSDUDOOHOLVPZHUHGHSHQGHQWXSRQKDUGZDUH VHH7&&EHLQJH[WHQGHGWREHPRUHVFDODEOHE\LPSRVLQJOHYHOV WRDYRLGH[WUDQHRXVYLRODWLRQV0HPRU\UHQDPLQJELWVZHUHRQO\ RI KLHUDUFK\ RQ WKH FRPPLW DUELWUDWLRQ DQG VQRRS PHFKDQLVPV FULWLFDOIRUWZRRIWKH-DYD7/6DSSOLFDWLRQVME\WHB%DQGPWUW DQGSRVVLEO\E\DOORZLQJVRPHRYHUODSEHWZHHQFRPPLWV0RUH DVWKH\UHXVHGVRPH´VFUDWFKSDGµGDWDVWUXFWXUHVLQHDFKWUDQVDF IXQFWLRQDOLW\ PD\ DOVR EH DGGHG VXFK DV WKH KDUGZDUH FRPPLW WLRQ2XUDQDO\VLVDOVRVKRZHGOLWWOHJDLQIURPGRXEOHEXIIHULQJ


PHFKDQLVPVGHVFULEHGLQ6HFWLRQH[WHQVLRQVWRWKHGDWDOR FDOL]DWLRQPHQWLRQHGLQ6HFWLRQRUV\VWHPUHOLDELOLW\PHFKD QLVPVWKDWXVH7&&·VFRQWLQXRXVVSHFXODWLYHWUDQVDFWLRQVWRUROO EDFNWKHFXUUHQWWUDQVDFWLRQDIWHUWUDQVLHQWIDXOWV

&21&/86,216 :HKDYHDQDO\]HGDYDULHW\RILPSOHPHQWDWLRQVRI7&&V\VWHPV LQFOXGLQJDQRSWLPDORQHDQGGHWHUPLQHGWKDW7&&FDQEHXVHG WRREWDLQJRRGSHUIRUPDQFHRYHUDZLGHYDULHW\RIH[LVWLQJSDUDO OHODSSOLFDWLRQGRPDLQVZKLOHSURYLGLQJDSURJUDPPLQJPRGHO WKDWVLJQLÀFDQWO\VLPSOLÀHVWKHWDVNRIZULWLQJSDUDOOHOSURJUDPV 2XUDQDO\VLVRI7&&ZLWKDZLGHUDQJHRIDSSOLFDWLRQVVKRZVWKDW HDFK SURFHVVRU QRGH UHTXLUHV ² .% RI UHDG EXIIHULQJ VSDFH LQLWVFDFKHVDQG².%RIZULWHEXIIHULQJWRDFKLHYHKLJKSHU IRUPDQFHH[HFXWLRQRQPRVWDSSOLFDWLRQV7KLVEXIIHUPHPRU\ DGGVOLWWOHRYHUKHDGWRWKHH[LVWLQJFDFKHKLHUDUFK\DOUHDG\SUHV HQWZLWKLQWKHQRGH7KHPDLQOLPLWDWLRQRI7&&LVWKDWLWUHTXLUHV KLJKEURDGFDVWEDQGZLGWKDPRQJWKHSURFHVVRUQRGHVWRPDLQWDLQ DOOSURFHVVRUV·PHPRU\LQDFRKHUHQWVWDWH)RUDQSURFHVVRU V\VWHPWKHLQWHUSURFHVVRULQWHUFRQQHFWEDQGZLGWKPXVWEHODUJH HQRXJKWRVXVWDLQDERXW²E\WHVSHUF\FOHSHUDYHUDJHSURFHV VRU ,3& WR VXSSRUW DQ XSGDWH SURWRFRO RU XVXDOO\ OHVV WKHQ  E\WHVSHUF\FOHIRUDQLQYDOLGDWHSURWRFRO7KHVHUDWHVDUHHDV\ WRVXVWDLQZLWKLQD&03DQGSHUKDSVHYHQDVLQJOHERDUGPXOWL SURFHVVRU2QWKHVHW\SHVRIV\VWHPVZHEHOLHYHWKDW7&&FRXOG EHDKLJKSHUIRUPDQFHEXWPXFKVLPSOHUDOWHUQDWLYHWRWUDGLWLRQDO FDFKHFRKHUHQFHDQGFRQVLVWHQF\

$&.12:/('*(0(176 :HZRXOGOLNHWRWKDQNWKHDQRQ\PRXVUHYLHZHUVIRUWKHLUYDOX DEOH IHHGEDFN  7KLV ZRUN ZDV VXSSRUWHG E\ 16) JUDQW &&5  DQG '$53$ 3&$ SURJUDP JUDQWV ) DQG)

5()(5(1&(6 >@

69$GYHDQG.*KDUDFKRUORR´6KDUHG0HPRU\&RQVLVWHQF\0RGHOV$ 7XWRULDOµ,(((&RPSXWHU9RO1RSS²'HF >@ 69$GYHDQG0'+LOO´:HDN2UGHULQJ$1HZ'HÀQLWLRQµ3URFRIWKH WK$QQXDO,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUH-XQH >@ $$JDUZDO-/+HQQHVV\56LPRQLDQG0$+RURZLW]´$Q(YDOXDWLRQ RI'LUHFWRU\6FKHPHVIRU&DFKH&RKHUHQFHµ3URFHHGLQJVRIWKHWK ,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUH-XQH >@ $$KPHG3&RQZD\%+XJKHV):HEHU´$0'2SWHURQŒ6KDUHG 0HPRU\036\VWHPVµ&RQIHUHQFH5HFRUGRI+RW&KLSV6WDQIRUG&$ $XJ >@ '%RVVHQ-7HQGOHU.5HLFN´3RZHUV\VWHPGHVLJQIRUKLJKUHOLDELOLW\µ ,(((0,&520DJD]LQH9RO1RSS²0DUFK$SULO >@ %\WH0DJD]LQHM%<7(PDUN%HQFKPDUNKWWSZZZE\WHFRP&030HGLD //& >@ $&KDUOHVZRUWK´6WDUÀUH([WHQGLQJWKH603(QYHORSHµ,(((0LFUR 0DJD]LQH9RO1RSS-DQ)HE >@ 0.&KHQDQG.2OXNRWXQ´7KH-USP6\VWHPIRU'\QDPLFDOO\ 3DUDOOHOL]LQJ-DYD3URJUDPVµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO 6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUH ,6&$ SS²-XQH >@ 0'XERLV&6FKHXULFKDQG)%ULJJV´6\QFKURQL]DWLRQ&RKHUHQFHDQG (YHQW2UGHULQJµ,(((&RPSXWHU)HEUXDU\ >@ 0)UDQNOLQDQG*6RKL´$5%$KDUGZDUHPHFKDQLVPIRUG\QDPLF UHRUGHULQJRIPHPRU\UHIHUHQFHVµ,(((7UDQVDFWLRQVRQ&RPSXWHUV9RO 1RSS²0D\ >@ .'*KDUDFKRUORR-/DXGRQ3*LEERQV$*XSWDDQG-/+HQQHVVH\ ´0HPRU\&RQVLVWHQF\DQG(YHQW2UGHULQJLQ6FDODEOH6KDUHG0HPRU\ 0XOWLSURFHVVRUVµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO6\PSRVLXPRQ &RPSXWHU$UFKLWHFWXUH-XQH

>@ &*QLDG\%)DOVDÀDQG719LMD\NXPDU´,V6&,/3 5&µ 3URFHHGLQJVRIWKHWK$QQXDO,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU $UFKLWHFWXUHSS²0D\ >@ -5*RRGPDQ´8VLQJ&DFKH0HPRU\WR5HGXFH3URFHVVRU0HPRU\ 7UDIÀFµ3URFHHGLQJVRIWKHWK$QQXDO,QWHUQDWLRQDO6\PSRVLXPRQ &RPSXWHU$UFKLWHFWXUH-XQH >@ 6*RSDO719LMD\NXPDU-(6PLWKDQG*66RKL´6SHFXODWLYH 9HUVLRQLQJ&DFKHµ3URFHHGLQJVRIWKH)RXUWK,QWHUQDWLRQDO6\PSRVLXPRQ +LJK3HUIRUPDQFH&RPSXWHU$UFKLWHFWXUH +3&$ )HE >@ -*UD\DQG$5HXWHU7UDQVDFWLRQ3URFHVVLQJ&RQFHSWVDQG7HFKQLTXHV 0RUJDQ.DXIPDQQ >@ /+DPPRQG%+XEEHUW06LX03UDEKX0&KHQDQG.2OXNRWXQ ´7KH6WDQIRUG+\GUD&03µ,(((0,&520DJD]LQH0DUFK$SULO >@ 0+HUOLK\DQG-0RVV´7UDQVDFWLRQDO0HPRU\$UFKLWHFWXUDO6XSSRUW IRU/RFN)UHH'DWD6WUXFWXUHVµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO 6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUHSS >@ -DYD*UDQGH)RUXP-DYD*UDQGH%HQFKPDUN6XLWHKWWSZZZHSFFHGDF XNMDYDJUDQGH >@ 5.DOOD%6LQKDUR\DQG-7HQGOHU´6LPXOWDQHRXV0XOWLWKUHDGLQJ ,PSOHPHQWDWLRQLQ32:(5µ&RQIHUHQFH5HFRUGRI+RW&KLSV 6\PSRVLXP6WDQIRUG&$$XJ >@ 9.ULVKQDQDQG-7RUUHOODV´$&KLS0XOWLSURFHVVRU$UFKLWHFWXUHZLWK 6SHFXODWLYH0XOWLWKUHDGLQJµ,(((7UDQVDFWLRQVRQ&RPSXWHUV6SHFLDO ,VVXHRQ0XOWLWKUHDGHG$UFKLWHFWXUH6HSWHPEHU >@ +7.XQJDQG-75RELQVRQ´2Q2SWLPLVWLF0HWKRGVIRU&RQFXUUHQF\ &RQWUROµ$&07UDQVDFWLRQVRQ'DWDEDVH6\VWHPV9RO1R-XQH >@ -3/DPSRUW´+RZWRPDNHD0XOWLSURFHVVRU&RPSXWHUWKDW&RUUHFWO\ ([HFXWHV0XOWLSURFHVV3URJUDPVµ,(((7UDQVDFWLRQVRQ&RPSXWHUV9RO 1RSS >@ '/HQRVNL-/DXGRQ.*KDUDFKRUORR$*XSWDDQG-/+HQQHVV\ ´7KH6WDQIRUG'$6+0XOWLSURFHVVRUµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO 6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUH-XQH >@ 00DUWLQ0+LOODQG':RRG´7RNHQ&RKHUHQFH'HFRXSOLQJ 3HUIRUPDQFHDQG&RUUHFWQHVVµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO 6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUHSS²-XQH >@ -0DUWLQH]DQG-7RUUHOODV´6SHFXODWLYH6\QFKURQL]DWLRQ$SSO\LQJ 7KUHDG/HYHO6SHFXODWLRQWR3DUDOOHO$SSOLFDWLRQVµ3URFHHGLQJVRIWKH WK,QWHUQDWLRQDO&RQIHUHQFHRQ$UFKLWHFWXUDO6XSSRUWIRU3URJUDPPLQJ /DQJXDJHVDQG2SHUDWLQJ6\VWHPV $63/26; 2FWREHU >@ &0F1DLU\DQG'6ROWLV´,WDQLXP3URFHVVRU0LFURDUFKLWHFWXUHµ,((( 0,&520DJD]LQH9RO1RSS²0DUFK$SULO >@ -0RUHLUD60LGNLII0*XSWDDQG3$UWLJDV1XPHULFDOO\,QWHQVLYH-DYD ,%0DWKWWSZZZDOSKDZRUNVLEPFRPWHFKQLQMD$SULO >@ 03DSDPDUFRVDQG-3DWHO´$/RZ2YHUKHDG&RKHUHQFH6ROXWLRQIRU 0XOWLSURFHVVRUVZLWK3ULYDWH&DFKH0HPRULHVµ3URFHHGLQJVRIWKHWK $QQXDO,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUH-XQH >@ 0.3UDEKXDQG.2OXNRWXQ´8VLQJ7KUHDG/HYHO6SHFXODWLRQWR 6LPSOLI\0DQXDO3DUDOOHOL]DWLRQµ3URFHHGLQJVRIWKH3ULQFLSOHVDQG 3UDFWLFHRI3DUDOOHO3URJUDPPLQJ 33R33 SS²-XQH >@ 55DMZDUDQG-*RRGPDQ´7UDQVDFWLRQDO/RFN)UHH([HFXWLRQRI/RFN %DVHG3URJUDPVµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO&RQIHUHQFHRQ $UFKLWHFWXUDO6XSSRUWIRU3URJUDPPLQJ/DQJXDJHVDQG2SHUDWLQJ6\VWHPV $63/26; 2FWREHU >@ 55DMZDUDQG-*RRGPDQ´6SHFXODWLYH/RFN(OLVLRQ(QDEOLQJ+LJKO\ &RQFXUUHQW0XOWLWKUHDGHG([HFXWLRQµ3URFHHGLQJVRIWKHWK,QWHUQDWLRQDO 6\PSRVLXPRQ0LFURDUFKLWHFWXUH 0,&52 'HFHPEHU >@ 35XQGEHUJDQG36WHQVWURP´5HRUGHUHG6SHFXODWLYH([HFXWLRQRI&ULWLFDO 6HFWLRQVµ3URFHHGLQJVRIWKH,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO 3URFHVVLQJ ,&33· )HE >@ *6RKL6%UHDFKDQG79LMD\NXPDU´0XOWLVFDODUSURFHVVRUVµ 3URFHHGLQJVRIWKHQG$QQXDO,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU $UFKLWHFWXUHSS²-XQH >@ 6WDQGDUG3HUIRUPDQFH(YDOXDWLRQ&RUSRUDWLRQ63(& KWWSZZZ VSHFEHQFKRUJ:DUUHQWRQ9$² >@ 6WDQIRUG3DUDOOHO$SSOLFDWLRQVIRU6KDUHG0HPRU\ 63/$6+ KWWS ZZZÁDVKVWDQIRUGHGXDSSV63/$6+ >@ -6WHIIDQDQG70RZU\´7KH3RWHQWLDOIRU8VLQJ7KUHDG/HYHO'DWD 6SHFXODWLRQWR)DFLOLWDWH$XWRPDWLF3DUDOOHOL]DWLRQµ3URFHHGLQJVRI WKH)RXUWK,QWHUQDWLRQDO6\PSRVLXPRQ+LJK3HUIRUPDQFH&RPSXWHU $UFKLWHFWXUH/DV9HJDV1HYDGD >@ 7:LONLQVRQ.DIIH9LUWXDO0DFKLQHKWWSNDIIHRUJ² >@ 6:RR02KDUD(7RUULH-36LQJKDQG$*XSWD´7KH63/$6+ 3URJUDPV&KDUDFWHUL]DWLRQDQG0HWKRGRORJLFDO&RQVLGHUDWLRQVµ 3URFHHGLQJVRIWKHQG,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU$UFKLWHFWXUH SS²-XQH

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04) 1063-6897/04 $ 20.00 © 2004 IEEE


C OV ER F E AT U RE

Overcoming CAP with Consistent Soft-State Replication Kenneth P. Birman, Daniel A. Freedman, Qi Huang, and Patrick Dowell, Cornell University

New data-consistency models make it possible for cloud computing developers to replicate soft state without encountering the limitations associated with the CAP theorem.

T

THE CAP THEOREM’S BROAD REACH

he CAP theorem explores tradeoffs between consistency, availability, and partition tolerance, and concludes that a replicated service can have just two of these three properties.1,2 To prove CAP, researchers construct a scenario in which a replicated service is forced to respond to conflicting requests during a wide-area network outage, as might occur if two different datacenters hosted replicas of some single service, and received updates at a time when the network link between them was down. The replicas respond without discovering the conflict, resulting in inconsistency that might confuse an end user. However, there are important situations in which cloud computing developers depend upon data or service replication, and for which this particular proof does not seem to apply. Here, we consider such a case: a scalable service running in the first tier of a single datacenter. Today’s datacenters employ redundant networks that almost never experience partitioning failures: the “P” in CAP does not occur. Nonetheless, many cloud computing application developers believe in a generalized CAP “folk theorem,” holding that scalability and elasticity are incompatible with strong forms of consistency.

50

COMPUTER

Our work explores a new consistency model for data replication in first-tier cloud services. The model combines agreement on update ordering with a form of durability that we call amnesia freedom. Our experiments confirm that this approach scales and performs surprisingly well.

The CAP theorem has been highly influential within the cloud computing community, and is widely cited as a justification for building cloud services with weak consistency or assurance properties. CAP’s impact has been especially important in the first-tier settings on which we focus in this article. Many of today’s developers believe that CAP precludes consistency in first-tier services. For example, eBay has proposed BASE (Basically Available replicated Soft state with Eventual consistency), a development methodology in which services that run in a single datacenter on a reliable network are deliberately engineered to use potentially stale or incorrect data, rejecting synchronization in favor of faster response, but running the risk of inconsistencies.3 Researchers at Amazon.com have also adopted BASE. They point to the self-repair mechanisms in the Dynamo keyvalue store as an example of how eventual consistency behaves in practice.4 Inconsistencies that occur in eBay and Amazon cloud applications can often be masked so that users will not notice them. The same can be said for many of today’s most popular cloud computing uses: how much consistency is really needed by YouTube or to support Web searches? However, as applications with stronger assurance needs

Published by the IEEE Computer Society

0018-9162/12/$31.00 © 2012 IEEE


migrate to the cloud, even minor inconsistencies could endanger users. For example, there has been considerable interest in creating cloud computing solutions for medical records management or control of the electric power grid. Does CAP represent a barrier to building such applications, or can stronger properties be achieved in the cloud? At first glance, it might seem obvious that the cloud can provide consistency. Many cloud applications and products offer strong consistency guarantees, including databases and scalable global file systems. But these products do not run in the first tier of the cloud—the client-facing layer that handles incoming requests from browsers or responds to Web services method invocations. There is a perception that strong consistency is not feasible in these kinds of highly elastic and scalable services, which soak up much of the workload. Here, we consider first-tier applications that replicate data—either directly, through a library that offers a keyvalue storage API, or with some form of caching. We ignore application details, and instead look closely at the multicast protocols used to update replicated data. Our work confirms that full-fledged atomic multicast probably does not scale well enough for use in this setting (a finding that rules out using durable versions of Paxos or ACID transactions to replicate data), and in this sense we agree with the generalization of CAP. However, we also advocate other consistency options that scale far better: the amnesia-free protocols. By using them, developers can overcome the limitations associated with CAP.

THE FIRST-TIER PROGRAMMING MODEL Cloud computing systems are generally structured into tiers: a first tier that handles incoming client requests (from browsers, applications using Web services standards, and so on), caches and key-value stores that run near the first tier, and inner-tier services that provide database and file system functionality. Back-end applications work offline, preparing indices and other data for later use by online services. Developers typically tune applications running in the first tier for rapid response, elasticity, and scalability by exploiting a mixture of aggressive replication and very loose coupling between replicas. A load-balancer directs requests from client systems to a service instance, which runs the developer-provided logic. To minimize delay, an instance will compute as locally as possible, using cached data if available, and launching updates in the background (asynchronously). In support of this model, as little synchronization is done as possible: first-tier services are nontransactional and run optimistically, without obtaining locks or checking that cached data is still valid. Updates are applied optimistically to cached data, but the definitive outcome will not occur until the real update is finally done later,

on the underlying data; if this yields a different result, the system will often simply ignore the inconsistency. While first-tier services can query inner services, rather than focusing on that pattern, we consider reads and updates that occur entirely within the first tier, either on data replicated by first-tier service instances or within the key-value stores and caches supporting them. Modern cloud development platforms standardize this model, and in fact take it even further. First-tier applications are also required to be stateless: to support rapid instance launch or shutdown, cloud platforms launch each new instance in a standard initial state, and discard any local data when an instance fails or is halted by the management infrastructure. “Stateless” doesn’t mean that these instances have no local data at all, but rather that they are limited to a soft state that won’t be retained across

Cloud computing systems are generally structured into tiers: a first tier that handles incoming client requests, caches and key-value stores that run near the first tier, and inner-tier services that provide database and file system functionality. instance failure/launch sequences. On launch, such state is always clean; a new instance initializes itself by copying data from some operational instance or by querying services residing deeper in the cloud. Even an elastic service will not be completely shut down without warning: to ensure continuous availability, cloud management platforms can be told to keep some minimum number of replicas of each service running. Thus, these services vary the number of extra replicas for elasticity, but preserve a basic level of availability. Jointly, these observations enable a form of weak durability. Data replicated within the soft state of a service, in members that the management platform will not shut down (because they reside within the core replica set), will remain available unless a serious failure causes all the replicas to crash simultaneously, a rare occurrence in a well-designed application. Today’s first-tier applications often use stale or otherwise inconsistent data when responding to requests because the delays associated with fetching (or even validating) data are perceived as too high. Even slight delays can drive users away. Web applications try to hide resulting problems or to clean them up later (eventual consistency), possibly ending up in a state at odds with what the client saw. Cloud owners justify these architectural decisions by asserting a generalized CAP principle, arguing that consistency is simply incompatible with scalability and rapid responsiveness.

FEBRUARY 2012

51


C OV ER F E AT U RE Adopting a consistency model better matched to firsttier characteristics permits similar responsiveness while delivering stronger consistency guarantees. This approach could enable cloud-hosting of applications that need to justify the responses they provide to users, such as medical systems that monitor patients and control devices. Consistency can also enhance security: a security system that bases authorization decisions on potentially stale or incorrect underlying data is at risk of mistakes that a system using consistent data will not make.

CONSISTENCY: A MULTIDIMENSIONAL PROPERTY Developers can define consistency in many ways. Prior work on CAP defined “C” by citing the ACID (atomicity, consistency, isolation, and durability) database model; in this approach the “C” in CAP is defined to be the “C” and “D” from ACID. Consistency, in effect, is conflated with

A security system that bases authorization decisions on potentially stale or incorrect underlying data is at risk of mistakes that a system using consistent data will not make. durability. Underscoring this point, several CAP and BASE articles cite Paxos as the usual way to implement consistent replication: an atomic multicast protocol that provides total ordering and durability. Durability is the guarantee that once an update has been ordered and marked as deliverable, it will never be lost, even if the entire service crashes and then restarts. But for a first-tier service, durability in this direct sense is clearly not feasible: any state that can endure such a failure would necessarily be stored outside the service itself. Thus, by focusing on mechanisms that expend effort to guarantee durability, CAP researchers arrive at conclusions that are not applicable in first-tier services, which are limited to soft state.

Membership Any replication scheme needs a membership model. Consider some piece of replicated data in a first-tier service: the data might be replicated across the full set of first-tier application instances or it might reside just within some small subset of them (in the latter case the term shard is often used). Which nodes are supposed to participate? For the case in which every replica has a copy of the data item, the answer is evident: all the replicas currently running. But because cloud platforms vary this set elastically, the actual collection will change over time, perhaps rapidly. Full replication necessitates tracking the set,

52

COMPUTER

having a policy for initializing a newly launched service instance, and ensuring that each update reaches all the replicas, even if that set is quite large. For sharded data, any given item will be replicated at just a few members, hence a mapping from key (item-id) to shard is needed. Since each service instance belongs to just a few shards but potentially needs access to all of them, a mechanism is also needed so that any instance can issue read or update requests to any shard. Moreover, since shard membership will vary elastically (but without completely shutting down the full membership of any shard), membership dynamics must be factored into the model. One way to handle such issues is seen in Amazon’s Dynamo key-value store,5 which is a form of distributed hash table (DHT). Each node in Dynamo is mapped (using a hashing function) to a location on a virtual ring, and the key associated with each item is similarly mapped to the ring. The closest node with a mapped id less than or equal to that of the item is designated as its primary owner, and the value is replicated to the primary and to the next few (typically three) nodes along the ring: the shard for that key. Shard mappings change as nodes join and leave the ring, and data is moved around accordingly (a form of state transfer). Amazon apparently coordinates this with the cloud management service to minimize abrupt elasticity decisions that would shut down entire shards faster than the members can transfer state to new owners. A second way to implement shards occurs in systems that work with process groups:6 here, a group communication infrastructure such as our new Isis2 system solves the various requirements. Systems of this sort offer an API that provides much of the needed functionality: ways for processes to create, join, and leave groups; group names that might encode a key (such as “shard123”); a state transfer mechanism to initialize a joining member from the state of members already active; and other synchronization features. Isis2, for example, implements the virtual synchrony model.7 The developer decides how shards should work, then uses the provided API to implement the desired policy over the group infrastructure tools. Again, coordination with the cloud management infrastructure is required to avoid disruptive elasticity events. Services that run on a stable set of nodes for a long period employ a third shard-implementation option that enables a kind of static membership in which some set of nodes is designated as running the service. Here, membership remains fixed, but some nodes might be down when a request is issued. This forces the use of quorum replication schemes, in which only a quorum of replicas see each update, but reading data requires accessing multiple replicas; state transfer is not needed unless the static membership must be reconfigured—a rare and costly operation. Several CAP articles refer to the high cost of quorum operations, suggesting that many in the


community have had direct experience with this category of solutions, which includes the most widely used Paxos libraries. Notice that quorum operations are needed in this static membership case, but not in the others: the DHT and process group approaches avoid the need for quorums because they evolve shard membership as nodes join, leave, or fail. This matters because quorum operations residing on the critical path for a first-tier request introduce nonlocal delays, which cloud developers are intent upon minimizing in the interests of rapid response and scalability. With dynamic shard membership both reads and writes can be done locally. The multicasts that update remote replicas occur asynchronously, in parallel with computation of the response that will be sent to the client.

Update ordering A second dimension of consistency concerns the policy for applying updates to replicas. A consistent replication scheme applies the same updates to every replica in the same order and specifies the correct way to initialize new members or nodes recovering from a failure.6-8 Update ordering costs depend on the pattern for issuing updates. In many systems, each data item has a primary copy through which updates are routed, and one or more replicas function as hot-standbys and can support read-only operations. Some systems shift the role of being primary around, but the basic idea is the same: in both cases, delivering updates in the order they were sent, without gaps, keeps the system consistent across multiple replicas. The resulting problem is very much like FIFO delivery of data with TCP, and is solved in much the same manner. A more costly multicast ordering property is needed if every replica can initiate concurrent, conflicting updates to the same data items. When concurrent updates are permitted, the multicast mechanism must select an agreed-upon order, then the delivery order ensures that the replicas apply the updates in a consistent order. The numerous ways to build such mechanisms are not tremendously complex, but they do introduce extra protocol steps, which slows down the update protocol. The CAP and BASE literature does not discuss this issue explicitly, but the examples used point to systems that permit concurrent updates. Simply requiring replicated data to have a primary copy can yield a significant cost reduction.

Durability Yet a third dimension involves durability of updates. Obviously, an update that has been performed is durable if the service will not forget it. But precisely what does it mean to have “performed” an update? Must the durability mechanism retain data across complete shutdowns of a service or shard’s full membership?

In applications where the goal is to replicate a database or file (some form of external storage), durability involves mechanisms such as write-ahead logs: all the replicas push updates to their respective logs, then acknowledge that they are ready to commit the update; in a second phase, the system can apply the updates in the logs to the actual database. Thus, while the Paxos protocol9,10 does not refer to the application per se, most Paxos implementations are designed to work with external applications like databases for which durability entails logging of pending updates. This presumes durable storage that will survive failures. The analogous database mechanism would be a write-ahead log. But first-tier services are required to be stateless. Can they replicate data in a way that offers a meaningful durability property? The obvious answer is to consider in-memory update replication: we could distinguish between a service that might respond to a client before every replica knows of the updates triggered by that client’s request, and a service that delays until after every replica has acknowledged the relevant updates. We call the former solution nondurable: if the service has n members, even a single failure can leave

Simply requiring replicated data to have a primary copy can yield a significant cost reduction.

n – 1 replicas in a state where they will never see the update. We refer to the latter solution as amnesia freedom: the service will not forget the update unless all n members fail. Recall that coordinating with the cloud management platform minimizes abrupt shutdowns of all n replicas. Amnesia freedom is not perfect. If a serious failure forces an entire service or shard to shut down, unless the associated data is backed up on an inner-tier service, state will be lost. But because this is a rare case, such a risk may be quite acceptable. For example, suppose that applications for monitoring and controlling embedded systems such as medical monitoring devices move to cloud-hosted settings. While these applications require consistency and other assurance properties, the role of online monitoring is continuous. Moreover, applications of this sort generally revert to a fail-safe mode when active control is lost. Thus, an inner-tier service might not be needed. Applications that do push updates to an inner service have a choice: they could wait for the update to be acknowledged or they could adopt amnesia freedom. However, in so doing, they accept a window of vulnerability for the (hopefully brief) period after the update is fully replicated in the memory of the first-tier service, but before it reaches the inner tier.

FEBRUARY 2012

53


C OV ER F E AT U RE Sometimes, a rapid response might be so important that it is not possible to wait for an inner-tier response; if so, amnesia freedom represents a compromise that greatly reduces the risk of update loss at very low cost. To offer another analogy, while database products generally support true multicopy serializability, costs can be high. For this reason, database mirroring is more often done by asynchronously streaming a log, despite the small risk that a failure could cause updates to be lost. Thus, databases often operate in the same manner as an amnesia-free solution that asynchronously pushes updates to an inner-tier service.

Failure mode Our work assumes that applications fail by crashing and that network packets can be lost. Within a single datacenter, if a partition occurs, the affected machines are treated as if they had crashed: when the partition is repaired, those nodes will be forced to restart.

Fusing order-based consistency with amnesia freedom results in a strongly consistent model with a slightly weaker notion of durability.

Putting it all together From this complex set of choices and options, it is possible to construct diverse replication solutions with very different properties, required structure, and expected performance. While some make little sense in the first tier; others represent reasonable options, including the following: ••

••

••

54

Build protocols that replicate data optimistically and later heal any problems that arise, perhaps using gossip (BASE). Updates are applied in the first tier, but then passed to inner-tier services that might perform them in different orders. Build protocols synchronized with respect to membership changes, and with a variety of ordering and durability properties—virtual synchrony and also “in-memory” versions of Paxos, where the Paxos durability guarantee applies only to in-memory data. Simply enforcing a barrier can achieve amnesia freedom: the system pauses if needed, delaying the response until any updates initiated by the request (or seen by the request through its reads) have reached all the replicas and thus become stable. Implement the state machine replication model, including strong durability—the guarantee that even if all service members crash, the current state will be recoverable. Most Paxos implementations use this

COMPUTER

••

model. However, in our target scenario, strong durability is not meaningful, and the first phase is limited to logging messages in the memory of the replicas themselves. Implement database transactions in the first tier, coupling them to the serialization order using inner tiers, for example, via a true multicopy model based on the ACID model or a snapshot-isolation model. The research community is currently investigating this approach.

How does CAP deal with these diverse options? The question is easier to pose than to answer. When Eric Brewer proposed CAP as a general statement,1 he offered the classic partitioning failure scenario as an illustration of a more broadly applicable principle. His references to consistency evoked ACID database properties. Seth Gilbert and Nancy A. Lynch offered their proof in settings with partitionable wide-area links.2 They pointed out that with even slight changes, CAP ceases to apply, and proposed a t-eventual consistency model that avoids the CAP tradeoff. In their work on BASE, Dan Pritchett 3 and Werner Vogels4 pointed to both the ACID model and the durable form of Paxos.9,10 They argued that these models are too slow for use in the first tier, expressing concerns that apparently stem from the costly two-phase structure of these particular protocols, and their use of quorum reads and updates, resulting in nonlocal responsiveness delays on the critical path that computes responses. The concerns about performance and scalability relate to durability mechanisms, not order-based consistency. Because sharded data predominates in the first tier, it is possible to require each shard to distinguish a primary copy where updates are performed first. Other replicas mirror the primary, supporting read-accesses, and taking over as primary only in the event of a reconfiguration. For a DHT-like structure, it is possible to use chain replication;11 in a process group, the system would load-balance the reads while routing updates through the primary, which would issue the needed multicasts. This reduces the cost of consistency to the trivial requirement of performing updates in FIFO order. Of course, this also requires synchronizing group membership changes relative to updates, but there is no reason that this should impact steady-state performance. We favor in-memory logging of updates, leading to amnesia freedom. Fusing order-based consistency with amnesia freedom results in a strongly consistent model with a slightly weaker notion of durability. Indeed, when acting on a request from an external user, any replica can perform everything locally with the appropriate data, up to the very last step: before responding to the external client, any updates triggered by a request must be stable— that is, they must have reached all the relevant replicas.


Deliver update to application Client

Log update to disk

Server A Server B Server C Isis group (view k) Request A:1 A:1 A:1 A:2 A:2 A:1 A:2 A:1 Flush wait A:2 for acks

Response

Client

Server A Server B Server C Isis group (view k) Request A:1

Accept A:1 Notify

Accept Ok A:1

Notify

A:1 Accept A:2 Notify

Accept Ok A:2 Response

A:1

Notify

A:2 A:2

(a)

(b)

A:2

Client

Server A Server B Server C Isis group (view k) Request A:1 A:1 A:1 Accept A:1 Accept Ok Notify Notify A:1 A:1 A:1 A:2 A:2 A:2 Accept A:2 Accept Ok Notify A:2 Notify A:2 Response A:2 (c)

Figure 1. Communication between three members of an Isis2 process group for various primitives: (a) Send, followed by a Flush barrier, (b) SafeSend (in-memory Paxos), and (c) durable (disk-logged) Paxos. A:1 and A:2 are two updates sent from server A.

The degree of durability will depend on how this delaying step is implemented. A service unwilling to tolerate any delay might respond while updates are still queued in the communication layer. Its responses to the client would reflect the orderly application of updates, but a crash could cause some updates to be forgotten. With amnesia freedom, other replicas back up an update: only the crash of the entire shard or service could result in their loss. Given that the cloud computing community has ruled out simply waiting for a durable inner-tier service to acknowledge the updates, this offers an appealing compromise: a surprisingly scalable, inexpensive, and resilient new option for first-tier services that need consistency.

THE ISIS2 SYSTEM Our Isis2 system supports virtually synchronous process groups,6 and includes reliable multicasts with various ordering options (available at www.cs.cornell.edu.ken/ isis2). The Send primitive is FIFO-ordered. An OrderedSend primitive guarantees total order; we will not be using it here because we assume that sharded data has a primary copy. Amnesia freedom is achieved by invoking a barrier primitive called Flush that delays until any prior unstable multicasts have reached all destinations. This kind of Flush sends no extra messages; it just waits for acknowledgments (Figure 1a). Isis2 also offers a virtually synchronous version of Paxos,7 via a primitive we call SafeSend. The user can specify the size of the acceptor set; we favor the use of three acceptors, but it is possible to select all members in a process group to serve as acceptors, or half of those members, or whatever is most appropriate to a given application. SafeSend offers two forms of durability: in-memory durability, which we use for soft-state replication in the first tier, and true on-disk durability. Here, we evaluate only the in-memory configuration,

since the stronger form of durability is not useful in the first tier. Figure 1 illustrates these protocol options. Figure 1a shows an application that issues a series of Send operations and then invokes Flush, which causes a delay until all the prior Sends have been acknowledged. In this particular run, updates A:1 and A:2 arrive out of FIFO order at member C, which delays A:2 until A:1 has been received; we illustrate this case to emphasize that implementing FIFO ordering is very inexpensive. Figure 1b shows our inmemory Paxos protocol with two acceptors (nodes A and B) and one additional member (C); all three are learners. Figure 1c shows this same case, but with the durable version of the Paxos protocol, emphasizing that in a durable mode, the protocol must store pending requests on disk rather than in memory. We ran our experiment in a datacenter with 48 machines, each with 12 Xeon X5650 cores (running at 2.67 GHz) and connected by Gigabit Ethernet. We designed a client to trigger bursts of work during which five multicasts are initiatedâ&#x20AC;&#x201D;for example, to update five replicated data objects. Two cases are considered: for one, the multicasts use Send followed by a Flush; the other uses five calls to SafeSend, in its in-memory configuration with no Flush. Figure 2a shows the latency between when processing started at the leader and when update delivery occurred, on a per-update basis. All three of these consistent-replication options scale far better than might be expected on the basis of the reports in the literature, but the amnesiafree Send significantly outperforms SafeSend even when configured with just three acceptors. The delay associated with Flush was dwarfed by other sources of latency variance, and we did not measure it separately. Notice that if we were sharding data by replicating it to just three to five members, all of these options would offer good performance.

FEBRUARY 2012

55


C OV ER F E AT U RE 2,500

SafeSend (all members as acceptors) SafeSend (three members as acceptors) Send

1,500

1,000

Mean delivery latency (ms)

Mean delivery latency (ms)

2,000

100 80 60 40 20 10 20 30 40 Number of group members

0

50

500

100

0

(a)

60

500

600

SafeSend (all members as acceptors) SafeSend (three members as acceptors) Send

50

Probability (percent)

200 300 400 Number of group members

40 30

ANALYSES OF CAP TRADEOFFS

20 10 0 –300

–200

(b)

–100 0 100 200 Delivery latency (difference from mean) (ms)

300

Cumulative probability (percent)

100 80 60 40 3 members 5 members 48 members 192 members

20 0

(c)

0

50

100 150 Flush latency (ms)

200

250

Figure 2. Latency between the start of processing and update delivery. (a) Mean delivery latency for the Send primitive, contrasted with SafeSend with three acceptors and SafeSend with acceptors equal to the number of members in the group. (b) Histogram showing the probability of delivery latencies relative to mean update latency, for all three protocols, for a group size of 192 members. (c) Cumulative histogram of Flush latency for various numbers of members in the group.

56

In Figure 2a we omitted error bars, and instead plotted variance from the mean in Figure 2b. This figure focuses on the case of 192 members; for each protocol we took the mean as reported on Figure 2a and then binned individual latencies for the 192 receivers as deltas from the mean, which can thus be positive or negative. While Send latencies are sharply peaked around the mean, the protocol does have a tail extending to as much as 100 ms but impacting only a small percentage of multicasts. For SafeSend the latency deviation is both larger and more common. These observations reflect packet loss: in loss-prone environments—for example, cloud-computing networks, which are notoriously prone to overload and packet loss—each protocol stage can drop messages, which must then be retransmitted. The number of stages explains the degree of spread: SafeSend latencies spread broadly, reflecting instances that incur zero, one, two, or even three packet drops (Figure 1b). In contrast, Send has just a single phase (Figure 1a), hence is at risk of at most loss-driven delay. Moreover, Send generates less traffic, resulting in a lower loss rate at the receivers. In the future, we plan to report on SafeSend performance with disk logging.

COMPUTER

In addition to Eric Brewer’s original work1 and reports by others who advocate for BASE,3,4 the relevant analyses of CAP and the possible tradeoffs (CA/CP/AP) include Jim Gray’s classic analysis of ACID scalability,12 a study by Hiroshi Wada and colleagues of NoSQL consistency options and costs, and other discussions of this topic.13-17 Database research that relaxes consistency to improve scalability includes the Escrow transaction model,18 PNUTS,19 and Sagas.20 At the other end of the spectrum, notable cloud services that scale well and yet offer strong consistency include the Google File System,21 Bigtable,22 and Zookeeper.23 Research focused on Paxos performance includes the Ring-Paxos protocol24,25 and the Gaios storage system.26 Our work employs a model that unifies Paxos (statemachine replication) with virtual synchrony.7 In addition to our prior work on amnesia freedom,6 other mechanisms that we have exploited in Isis2 include the IPMC allocation scheme from Dr. Multicast,27 and the tree-structured acknowledgments used in QuickSilver Scalable Multicast.2,28

T

he CAP theorem centers on concerns that the ACID database model and the standard durable form of Paxos introduce unavoidable delays. We have suggested that these delays are actually associated with durability, which is not a meaningful goal in the cloud’s first tier, where applications are limited to soft state. Nonetheless, an in-memory form of durability is feasible. Leveraging this, we can offer a spectrum of consistency


options, ranging from none to amnesia freedom to strong f-durability (an update will not be lost unless more than f failures occur). It is possible to offer ordered-based consistency (state machine replication), and yet achieve high levels of scalable performance and fault tolerance. Although the term amnesia freedom is new, our basic point is made in many comparisons of virtual synchrony with Paxos. A concern is that cloud developers, unaware that scalable consistency is feasible, might weaken consistency in applications that actually need strong guarantees. Obviously, not all applications need the strongest forms of consistency, and perhaps this is the real insight. Today’s cloud systems are inconsistent by design because this design point has been relatively easy to implement, scales easily, and works well for the applications that earn the most revenue in today’s cloud. The kinds of applications that need stronger assurance properties simply have not yet wielded enough market power to shift the balance. The good news, however, is that if cloud vendors ever tackle high-assurance cloud computing, CAP will not represent a fundamental barrier to progress.

Acknowledgments We are grateful to Robbert van Renesse, Dahlia Malkhi, the students in Cornell’s CS7412 class (spring 2011), and to the DARPA MRC program, which funds our efforts.

References 1. E. Brewer, “Towards Robust Distributed Systems,” Proc. 19th Ann. ACM Symp. Principles of Distributed Computing. (PODC 00), ACM, 2000, pp. 7-10. 2. S. Gilbert and N. Lynch, “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services,” ACM SIGACT News, June 2002, pp. 51-59. 3. D. Pritchett, “BASE: An Acid Alternative,” ACM Queue, May/ June 2008, pp. 48-55. 4. W. Vogels, “Eventually Consistent,” ACM Queue, Oct. 2008, pp. 14-19. 5. G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” Proc. 21st ACM SIGOPS Symp. Operating Systems Principles (SOSP 07), ACM, 2007, pp. 205-220. 6. K. Birman, “History of the Virtual Synchrony Replication Model,” Replication: Theory and Practice, LNCS 5959, Springer, 2010, pp. 91-120. 7. K.P. Birman, D. Malkhi, and R. van Renesse, Virtually Synchronous Methodology for Dynamic Service Replication, tech. report MSR-2010-151, Microsoft Research, 2010. 8. K. Birman and T. Joseph, “Exploiting Virtual Synchrony in Distributed Systems,” Proc. 11th ACM Symp. Operating Systems Principles (SOSP 87), ACM, 1987, pp. 123-138. 9. L. Lamport, “Paxos Made Simple,” ACM SIGACT News, Dec. 2008, pp. 51-58. 10. L. Lamport, “The Part-Time Parliament,” ACM Trans. Computer Systems, May 1998, pp. 133-169. 11. R. van Renesse, F.B. Schneider, “Chain Replication for Supporting High Throughput and Availability,” Proc. 6th Symp. Operating Systems Design & Implementation (OSDI 04), Usenix, 2004, pp. 7-7.

12. J. Gray et al., “The Dangers of Replication and a Solution,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD 96), ACM, 1996, pp. 173-182. 13. H. Wada et al., “Data Consistency Properties and the Trade-offs in Commercial Cloud Storage: The Consumers’ Perspective,” Proc. 5th Biennial Conf. Innovative Data Systems Research (CIDR 11), ACM, 2011, pp. 134-143. 14. D. Kossman, “What Is New in the Cloud?” keynote presentation, 6th European Conf. Computer Systems (EuroSys 11), 2011; http://eurosys2011.cs.uni-salzburg.at/pdf/eurosys11invited.pdf. 15. T. Kraska et al., “Consistency Rationing in the Cloud: Pay Only When It Matters,” Proc. VLDB Endowment (VLDB 09), ACM, 2009, pp. 253-264. 16. M. Brantner et al., “Building a Database on S3,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD 08), ACM, 2008, pp. 251-264. 17. D. Abadi, “Problems with CAP, and Yahoo’s Little Known NoSQL System,” blog; http://dbmsmusings.blogspot. com/2010/04/problems-with-cap-and-yahoos-little.html. 18. P.E. O’Neil, “The Escrow Transactional Method,” ACM Trans. Database Systems, Dec. 1986, pp. 405-430. 19. B.F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform,” Proc. VLDB Endowment (VLDB 08), ACM, 2008, pp. 1277-1288. 20. H. Garcia-Molina and K. Salem, “Sagas,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD 87), ACM, 1987, pp. 249-259. 21. S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” Proc. 19th ACM Symp. Operating Systems Principles (SOSP 03), ACM, 2003, pp. 29-43. 22. F. Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” Proc. 7th Usenix Symp. Operating Systems Design and Implementation (OSDI 06), Usenix, 2006, pp. 205-218. 23. F.P. Junqueira and B.C. Reed, “The Life and Times of a Zookeeper,” Proc. 21st Ann. Symp. Parallelism in Algorithms and Architectures (SPAA 09), ACM, 2009, pp. 4-4. 24. P. Marandi, M. Primi, and F. Pedone, “High-Performance State-Machine Replication,” Proc. IEEE/IFIP Int’l Conf. Dependable Systems and Networks (DSN 11), IEEE CS, 2011, pp. 454-465. 25. P.J. Marandi et al., “Ring-Paxos: A High-Throughput Atomic Broadcast Protocol,” Proc. IEEE/IFIP Int’l Conf. Dependable Systems and Networks (DSN 10), IEEE CS, 2010, pp. 527-536. 26. W.J. Bolosky et al., “Paxos Replicated State Machines as the Basis of a High-Performance Data Store,” Proc. 8th Usenix Symp. Networked Systems Design and Implementation (NSDI 11), Usenix, 2011, pp. 141-154. 27. Y. Vigfusson et al., “Dr. Multicast: Rx for Data Center Communication Scalability,” Proc. 5th European Conf. Computer Systems (EuroSys 10), ACM, 2010, pp. 349-362. 28. K. Ostrowski, K. Birman, and D. Dolev, “QuickSilver Scalable Multicast (QSM),” Proc. 7th IEEE Ann. Int’l Symp. Network Computing and Applications (NCA 08), IEEE, 2008, pp. 9-18.

Kenneth P. Birman is the N. Rama Rao Professor of Computer Science at Cornell University. His research focuses on high assurance for cloud computing and other scalable

FEBRUARY 2012

57


C OV ER F E AT U RE

Daniel A. Freedman is a postdoctoral research associate in the Department of Computer Science at Cornell University. His research interests span distributed systems and communication networks. Freedman received a PhD in theoretical physics from Cornell University. He is a member of IEEE, ACM SIGOPS and SIGCOMM, and Usenix. Contact him at dfreedman@cs.cornell.edu.

Qi Huang is a computer science PhD student at Cornell University. His research interests include distributed computing, networking, and operating systems. Huang received a BE in computer science from Huazhong University of Science and Technology, China. He is a student member of ACM. Contact him at qhuang@cs.cornell.edu. Patrick Dowell is a computer science ME student at Cornell University. His research interests include distributed systems, cloud computing, and security. Dowell received a BS in computer science from Cornell University. Contact him at pkd3@cs.cornell.edu.

IEEE SoftwarE offers pioneering ideas, expert analyses, and thoughtful insights for software professionals who need to keep up with rapid technology change. Itâ&#x20AC;&#x2122;s the authority on translating software theory into practice.

www.computer.org/ software/subscribe

58

COMPUTER

SubScribe TodaY

distributed systems. Birman received a PhD in computer science from University of California, Berkeley. He received the 2009 IEEE Tsukomo Kanai award for his work in distributed systems and is an ACM Fellow. Contact him at ken@ cs.cornell.edu.


G U E S T EDITO R’S IN T RODU C T IO N

The CAP Theorem’s Growing Impact Simon S.Y. Shim, San Jose State University

The computing community has been developing innovative solutions to meet the formidable challenge of handling the exponential growth in data generated by the Web.

F

or a long time, commercial relational database management systems with ACID (atomicity, consistency, isolation, and durability) properties from vendors such as Oracle, IBM, Sybase, and Microsoft have been the default home for computational data. However, with the phenomenal growth of Webgenerated data—which Vint Cerf referred to as an “information avalanche”—this conventional way of storing data has encountered a formidable challenge. Because the traditional way of handling petabytes of data with a relational database in the back end does not scale well, managing this phenomenon referred to as the big data challenge has become problematic. Highly scalable database solutions are needed to meet the demand of handling this data explosion. With the abundance of inexpensive commodity servers, public and private clouds can store and process big data effectively by scaling horizontally into distributed systems. To manage the exponentially growing data traffic, Internet companies such as Google, Amazon, Yahoo, Facebook, and Twitter have developed alternative solutions that store data in what have come to be known as NoSQL databases. Carlo Strozzi coined this term in 1990 to refer to data-

0018-9162/12/$31.00 © 2012 IEEE

bases that do not use SQL as their query language, and Eric Evans later used it to refer to nonrelational and distributed databases. Internet companies contributed their NoSQL databases as open source projects, which later became popular for storing large data. In general, NoSQL databases support flexible schema, scale horizontally, and, more interestingly, do not support ACID properties. They store and replicate data in distributed systems, often across datacenters, to achieve scalability and reliability. To tolerate network partitions and limit write latency, these systems relax the consistency requirement so that data updates are performed asynchronously, while potential data conflicts can be resolved at data reads. Hence, the system might return inconsistent values from distributed data stores depending on where it read the data. Readers must resolve the potential data inconsistency. With the advances in datacenter network infrastructures, network failures are rare, and this tradeoff between network partitions and data consistency is less relevant within a single datacenter. However, this remains a significant challenge for cloud providers who must maintain multiple datacenters in geographically separated regions. The computing community has been developing innovative solutions to address this issue.

IN THIS ISSUE In late 2000, Eric Brewer gave a talk on his famous CAP conjecture at the Principles of Distributed Computing Con-

Published by the IEEE Computer Society

FEBRUARY 2012

21


G U E S T EDITO R’S IN T RODU C T IO N ference. According to the CAP theorem, it is only possible to simultaneously provide any two of the three following properties in distributed Web-based applications: consistency (C), availability (A), and partition tolerance (P). Later, Seth Gilbert and Nancy A. Lynch proved the conjecture under certain circumstances. In “CAP Twelve Years Later: How the ‘Rules’ Have Changed,” Eric Brewer explains that when it is necessary to choose between C and A during network partitioning, the designer often chooses A. In “Perspectives on the CAP Theorem,” Seth Gilbert and Nancy A. Lynch review the theorem and discuss its practical implications. Cloud providers have broadened the interpretation of the CAP theorem in the sense that they consider a system as not available if the response time exceeds the latency limit. Thus, a slow network link is considered partitioned. To obtain extreme scalability and performance without node-to-node coordination or synchronization, developers relax consistency to make the system “available” even when there is no real network failure. As Werner Vogels, chief technology officer and vice president of Amazon, puts it, “We do not want to relax consistency. Reality, however, forces us to.” But many applications need consistency. In “Consistency Tradeoffs in Modern Distributed Database System Design,” Daniel J. Abadi clarifies that a distributed system often cannot support synchronous replication because many applications require low latency. Thus, consistency is sacrificed even when there is no network partition. To better understand the tradeoffs of the CAP theorem, Abadi suggests rewriting CAP, making the latency versus consistency tradeoff explicit. Building a highly available system by choosing availability over consistency is bound to increase the complexity of distributed systems. Data inconsistency imposes a dimension of design complexity in application development. Programmers need to know when to use fast/inconsistent accesses versus slow/consistent accesses to secure both high performance and correctness. Moreover, they might need to define the conflict resolution rules that meet the application’s needs.

INTERESTED IN LEARNING MORE ABOUT CAP?

I

n “NoSQL and Mongo DB with Dwight Merriman,” Software Engineering Radio features an interview with Merriman about the emerging NoSQL movement, the three types of nonrelational data stores, Brewer’s CAP theorem, the weaker consistency guarantees that can be made in a distributed database, documentoriented data stores, the data storage needs of modern Web applications, and the open source MongoDB: www.se-radio. net/2010/07/episode-165-nosql-and-mongodb-with-dwightmerriman.

22

COMPUTER

Developers have studied a few weak consistency models, which can be understood through the tradeoff between design complexity and availability—and thus, performance. Eventual consistency chooses availability, thus allowing updates to any closest replica. The system eventually propagates updates over time. However, this eventual consistency might not be able to guarantee the same order of updates. Application-specific rules are needed to resolve data conflicts. An example of this is Amazon’s shopping cart built using Dynamo, Amazon’s eventually consistent keyvalue store. This eventual consistency concept was first used in the early 1990s in the Coda, Bayou, and Ficus distributed systems. The developers of Yahoo’s PNUTS chose to provide stronger consistency at the expense of availability. PNUTS provides timeline consistency on a per-tuple basis, which guarantees that each tuple will undergo the same sequence of transformations at each storage replica. PNUTS employs a per-tuple master node that orders all updates made to the tuple, after which the system disseminates them asynchronously to the storage replicas. Conflict resolution becomes simpler, as storage replicas now see updates in the same order. However, coordinating all updates through a master may have obvious performance and availability implications. PNUTS alleviates these issues by automatically migrating the master to be close to the writers. As Raghu Ramakrishnan points out in “CAP and Cloud Data Management,” this makes the practical impact on performance and availability insignificant for Yahoo’s applications because of localized user access patterns. Finally, in “Overcoming CAP with Consistent Soft-State Replication,” Kenneth P. Birman and his coauthors advocate for even stronger consistency inside the datacenter, where partitions are rare. They show that in this setting, it is possible to achieve low latency and scalability without sacrificing consistency.

W

ith data generation guaranteed to grow rather than shrink, the articles included in this special issue demonstrate that, 12 years after Eric Brewer first proposed it, the CAP theorem continues to play a crucial role in understanding and optimizing distributed systems. Simon S.Y. Shim is a professor in the Computer Engineering Department at San Jose State University. Contact him at simon.shim@sjsu.edu.

Selected CS articles and columns are available for free at http://ComputingNow.computer.org.


Deconstructing Paxos∗† Romain Boichat+ + ∗

Partha Dutta+

Svend Frølund∗

Rachid Guerraoui+

Swiss Federal Institute of Technology, CH-1015 Lausanne

Hewlett-Packard Laboratories, 1501 Page Mill Rd, Palo Alto

Abstract The Paxos part-time parliament protocol of Lamport provides a very practical way to implement a fault-tolerant deterministic service by replicating it over a distributed message passing system. The contribution of this paper is a faithful deconstruction of Paxos that preserves its efficiency in terms of forced logs, messages and communication steps. The key to our faithful deconstruction is the factorisation of the fundamental algorithmic principles of Paxos within two abstractions: weak leader election and round-based consensus, itself based on a round-based register abstraction. Using those abstractions, we show how to reconstruct, in a modular manner, known and new variants of Paxos. In particular, we show how to (1) alleviate the need for forced logs if some processes remain up for sufficiently long, (2) augment the resilience of the algorithm against unstable processes, (3) enable single process decision with shared commodity disks, and (4) reduce the number of communication steps during stable periods of the system. Keywords: Distributed systems, fault-tolerance, replication, Paxos, modularisation, abstraction. Contact author: Romain Boichat. E-mail: Romain.Boichat@epfl.ch, Tel (Fax): +41 21 693 6702 (7570).

∗ The

Island of Paxos used to host a great civilisation, which was unfortunately destroyed by a foreign invasion. A famous archaeologist reported

on interesting parts of the history of Paxons and particularly described their sophisticated part-time parliament [11]. Paxos legislators maintained consistent copies of the parliamentary records, despite their frequent forays from the chamber and the forgetfulness of their messengers. Although recent studies explored the use of powerful tools to reason about the correctness of the parliament protocol [12, 16], our desire to better understand the Paxon civilisation motivated us to revisit the Island and spend some time deciphering the ancient manuscripts of the legislative system. We discovered that Paxons had precisely codified various aspects of their parliament protocol which enabled them easily adapt the protocol to specific functioning modes throughout the seasons. In particular, during winter, the parliament was heated and some legislators did never leave the chamber: their guaranteed presence helped alleviate the need for expensive writing of decrees on ledgers. This was easy to obtain precisely because the subprotocol used to “store and lock” decrees was precisely codified. In spring, and with the blooming days coming, some legislators could not stop leaving and entering the parliament and their indiscipline prevented progress in the protocol. However, because the election subprotocol used to choose the parliament president was factored out and precisely codified, the protocol could easily be adapted to cope with indisciplined legislators. During summer, very few legislators were in the parliament and it was hardly possible to pass any decree because of the lack of the necessary majority. Fortunately, it was easy to modify the subprotocol used to store and lock decrees and devise a powerful technique where a single legislator could pass decrees by directly accessing the ledgers of other legislators. Fall was a protest period and citizens wanted a faster procedure to pass decrees. Paxons noticed that, in most periods, messengers did not loose messages and legislators replied in time. They could devise a variant of the protocol that reduced the number of communication steps needed to pass decrees during those periods. This powerful optimisation was obtained through a simple refinement of the subprotocol used to propose new decrees. † This work was partially supported by the Swiss National Fund grant No. 510 207.

1


1 Introduction The Paxos Algorithm The Paxos part-time parliament algorithm of Lamport [11] provides a very powerful way to implement a highlyavailable deterministic service by replicating it over a system of non-malicious processes communicating through message passing. Replicas follow the state-machine pattern (also called active replication) [19]. Each correct replica computes every request and returns the result to the corresponding client which selects the first returned result. Paxos maintains replica consistency by ensuring total order delivery of requests. It does so even during unstable periods of the system, e.g., even if messages are delayed or lost and processes crash and recover. During stable periods, Paxos rapidly achieves progress. 1 As pointed out in [12, 16] however, Paxos is rather tricky and it is difficult to factor out the abstractions that comprise the algorithm. Deconstructing the algorithm and identifying those abstractions is an appealing objective towards specific reconstructions and practical implementations of it. In [12, 16], Lampson, De Prisco and Lynch focused on the key issue in the Paxos algorithm used to agree on a total order for delivering client requests to the replicas. This agreement aspect, factored out within a consensus abstraction, is deconstructed into a storage and a register part. As pointed out in [12, 16], one can indeed obtain a pedagogically appealing state machine replication algorithm as a straightforward sequence of consensus instances, but faithfully preserving the efficiency of the original Paxos algorithm goes through opening the consensus box and combining some of its underlying algorithmic principles with non-trivial techniques such as log piggy-backing and leasing. The aim of our paper is to describe a faithful deconstruction top to bottom, of the entire Paxos replication algorithm. Our deconstruction is faithful in the sense that it relies on abstractions that do no need to be opened in order to preserve the efficiency of the original Paxos replication scheme.

The Faithful Deconstruction A key to our faithful deconstruction is the identification of the new notion of round-based consensus, which is in a sense, finer-grained than consensus. 2 This new abstraction is precisely what allows us to preserve efficiency without sacrificing modularity. Our deconstruction of the overall Paxos state machine replication algorithm is modular, and yet it preserves the efficiency of the original algorithm in terms of forced logs, messages and communication steps. We use round-based consensus in conjunction with a leader election abstraction, both as first class citizens at the level of the replication algorithm. Round-based consensus allows us to expose the notion of round up to the replication scheme, as in the original Paxos replication algorithm (but in a more modular manner) and merge all forced logs of the round at the lowest level of abstraction. Round-based consensus also allows a process to propose more than once (e.g., after a crash and a recovery) without implying a forced log. Having the notion of leader as a first class abstraction at 1 In fact, the liveness of the algorithm relies on partial synchrony assumptions whereas safety does not: Paxos is â&#x20AC;&#x153;indulgentâ&#x20AC;? in the sense of [6]. In a stable period where the leader communicates in a timely manner with a majority of the processes (most frequent periods in practice), two communication steps (four if the client process is not leader) and one forced log at a majority of the processes are enough to perform a request and return a reply. 2 The round-based consensus is actually strictly weaker than consensus: it can be implemented with a majority of correct processes and does not fall within the FLP impossibility, yet it has a meaningful liveness property. Roughly speaking, round-based consensus is the abstraction that we obtain after extracting the leader election from consensus.

2


the level of the replication algorithm (and not hidden by a consensus box) enables the client to send its request directly to the leader, which can process several requests in a row.

Effective Reconstructions Not only do our abstractions of leader election and round-based consensus help faithfully deconstruct the original Paxos replication algorithm, they also enable us to straightforwardly reconstruct known and new variants of it by only modifying the implementation of one of our abstractions. For example, we show how to easily obtain a modularisation of the so-called Disk Paxos replication algorithm [5], where progress is ensured with a single correct process and a majority of correct disks, by simply modifying a component in round-based consensus (its round-based register). 3 We also show how to cleanly obtain the “Fast” Paxos variant by integrating the “lease-based” tricky optimisation, sketched in [11] and pointed out in [12]. This optimisation makes it possible in stable periods of the system (where “enough” processes communicate in a timely manner) for any leader to determine the order for a request in a single round-trip communication step. We also construct two new variants of Paxos. The first one is more resilient than the original one in the sense that it copes with unstable processes, i.e., processes that keep on crashing and recovering forever. (The original Paxos replication algorithm might not achieve progress in the presence of such processes.) Our second variant alleviates the need for stable storage and relies instead on some processes being always up. This variant is more efficient than the original one (stable storage is usually considered a major source of overhead) and intuitively reflects the practical assumption that only part of the total system can be down at any point in time, or indirectly, that the system configuration has a “large” number of replicas. 4 We point out that further variants can be obtained by mixing the variants we present in the paper, e.g., a Fast Disk Paxos algorithm or a Fast Paxos algorithm than handles unstable processes. Thanks to our modular approach, we could implement Paxos and its variants as a framework. We give here practical implementation measures of the various replication algorithms in this framework.

Roadmap The rest of the paper is organised as follows. Section 2 describes the model and the problem specification. Section 3 gives the specification of our abstractions. We show how to implement these specifications in a crash-stop model in Section 4, and how to transpose the implementation in a more general crash–recovery model in Section 5. Section 6 describes four interesting variants of the algorithm. Section 7 discusses related work. Appendix A gives some performance measurements of our framework implementation. Appendix B gives an implementation of the failure detector Ω in a crash-recovery model with partial asynchrony assumptions. 3 This typically makes sense if we have shared hard disks (some parallel database systems use this approach for fail-over when they mount each others disks) or if we have some notion of network-attached storage. 4 Note that such a configuration does not preclude the possibility of process crash-recovery. There is here a trade-off that reflects the real-world setting: fewer processes + forced logs vs more processes without forced logs.

3


2 Model 2.1 Processes We consider a set of processes Π = {p 1 , p2 , ..., pn }. At any given time, a process is either up or down. When it is up, a process progresses at its own speed behaving according to its specification (i.e., it correctly executes its program). Note that we do not make here any assumption on the relative speed of processes. While being up, a process can fail by crashing; it then stops executing its program and becomes down. A process that is down can later recover; it then becomes up again and restarts by executing a recovery procedure. The occurrence of a crash (resp. recovery) event makes a process transit from up to down (resp. from down to up). A process p i is unstable if it crashes and recovers infinitely many times. We define an always-up process as a process that never crashes. We say that a process p i is correct if there is a time after which the process is permanently up. 5 A process is faulty if it is not correct, i.e., either eventually always-down or unstable. A process is equipped with two local memories: a volatile memory and a stable storage. The primitives store and retrieve allow a process that is up to access its stable storage. When it crashes, a process loses the content of its volatile memory; the content of its stable storage is however not affected by the crash and can be retrieved by the process upon recovery.

2.2 Link Properties Processes exchange information and synchronise by sending and receiving messages through channels. We assume the existence of a bidirectional channel between every pair of processes. We assume that every message m includes the following fields: the identity of its sender, denoted sender(m), and a local identification number, denoted id(m). These fields make every message unique throughout the whole life of the process, i.e., a message cannot have the same id even after the crash and recovery of a process. Channels can lose or drop messages and there is no upper bound on message transmission delays. We assume channels that ensure the following properties between every pair of processes pi and pj : No creation: If p j receives a message m from p i at time t, then pi sent m to pj before time t. Fair loss: If pi sends a message m to pj an infinite number of times and p j is correct, then p j receives m from p i an infinite number of times. These properties characterise the links between processes and are independent of the process failure pattern occurring in the execution. The last property is sometimes called weak loss, e.g., in [14]. It reflects the usefulness of the communication channel. Without the weak loss property, any interesting distributed problem would be trivially impossible to solve. By introducing the notion of correct process into the fair loss property, we define the conditions under which a message is delivered to its recipient process. Indeed, the delivery of a message requires the recipient process to be running at the time the channel attempts to deliver it, and therefore depends on the failure pattern occurring in the execution. The fair loss property indicates that a message can be lost, either because the channel may not attempt to 5 In practice, a process is required to stay up long enough for the computation to terminate. In asynchronous systems however, characterising the notion of â&#x20AC;&#x153;long enoughâ&#x20AC;? is impossible.

4


deliver the message or because the recipient process may be down when the channel attempts to deliver the message to it. In both cases, the channel is said to commit an omission failure. We assume the presence of a discrete global clock whose range ticks τ is the set of natural numbers. This clock is used to simplify presentation and not to introduce time synchrony, since processes cannot access the global clock. We will indeed introduce some partial synchrony assumptions (otherwise, fault-tolerant agreement and total order are impossible [4]), but these assumptions will be encapsulated inside our weak leader election abstraction and used only to ensure progress (liveness). We give the implementation (with some details on the partial synchrony model) of the failure detector on which is based our weak leader election in Appendix B. Finally, we define a stable period when (i) the weak leader election returns the same process p l at all processes, (ii) there is a majority of processes that remains up, and (iii) no process or link crashes or recovers. Otherwise, we say that the system is in an unstable period.

3 Abstractions: Specifications Our deconstruction of Paxos is based on two main abstractions: a weak leader election and a round-based consensus, itself based on a round-based register (sub)abstraction. These “shared memory” abstractions export operations that are invoked by the processes implementing the replicated service. As in [10], we say that an operation invocation inv2 follows (is subsequent to) an operation invocation inv 1 , if inv2 was called after inv1 has returned. Otherwise, the invocations are concurrent. Roughly speaking, Paxos ensures that all processes deliver messages in the same order. The round-based consensus encapsulates the subprotocol used to “agree” on the order; the round-based register encapsulates the subprotocol used (within round-based consensus) to “store” and “lock” the agreement value (i.e., the order); and the weak leader election encapsulates the subprotocol used to eventually choose a unique leader that succeeds in storing and locking a final decision value in the register. We give here the specifications of these abstractions, together with the specification of the problem we solve using these abstractions, i.e., total order delivery. (Implementations are given in the next sections.) The specifications rely on the notion of process correctness: we assume that processes fail only by crashing, and a process is correct if there is a time after which the process is always-up (i.e., not crashed). 6

3.1 Round-Based Register Like a standard register, a round-based register is a shared register that has two operations: read(k) and write(k, v). These operations are invoked by the processes in the system. Unlike a standard register, the operation invocations of a round-based register (1) take as a parameter an integer k (i.e., a round number), and (2) may commit or abort. Note that the notion of round is the same for round-based register and round-based consensus: it corresponds to the notion of ballots in the original Paxos. The commit/abort outcome reflects the success or the failure of the operation. More precisely, the read(k) operation takes as input an integer k. It returns a pair (status, v) where status ∈ {commit, abort} and v ∈ V represents the set of possible values for the register; ⊥∈ V is the initial value of the register. If read(k) returns (commit, v) (resp. (abort, v)), we say that read(k) commits (resp. aborts) with v. The write(k, v) operation 6 Note that the validity period of this definition is the duration of a protocol execution, i.e., in practice, a process is correct if it eventually remains up long enough for the protocol to terminate.

5


takes as input an integer k and a value v ∈ V . It returns status ∈ {commit, abort}. If write(k, v) returns commit (resp. abort), we say that write(k, v) commits (resp. aborts). 7 Intuitively, when a read() invocation aborts, it gives information about what the process itself has done in the past (e.g., before it crashed and recovered), whereas when a write() invocation aborts, it gives to the process information about what other processes are doing. A round-based register satisfies the following properties: • Read-abort: If read(k) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  ≥ k. • Write-abort: If write(k, ∗) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  > k. • Read-write-commit: If read(k) or write(k, ∗) commits, then no subsequent read(k  ) can commit with k  ≤ k and no subsequent write(k  , ∗) can commit with k  < k.8 • Read-commit: If read(k) commits with v and v =⊥, then some operation write(k  , v) was invoked with k  < k. • Write-commit: If write(k, v) commits and no subsequent write(k  , v  ) is invoked with k  ≥ k and v  = v, then any read(k  ) that commits, commits with v if k  > k. These properties define the conditions under which the operations can abort or commit. Indirectly, these conditions relate the values read and written on the register. We first describe the condition under which an invocation can abort. Roughly speaking, an operation invocation aborts only if there is a conflicting invocation. Like in [11], the notion of “conflict” is defined here in terms of round numbers associated with the operations. Intuitively, a read() that commits returns the value written by a “previous” write(), or the initial value ⊥ if no write() has been invoked. A write() that commits forces a subsequent read() to return the value written, unless this value has been overwritten. The read-abort and write-abort conditions capture the intuition that a read(k) (resp. a write(k, v)) conflicts with any other operation (read(k  ) or write(k  , v)) made with k  ≥ k (resp. k  > k). The read-write commit condition expresses the fact that, to commit an operation, a process must use a round number that is higher than any round number of an already committed invocation. The read-commit condition captures the intuition that no value can be read unless it has been “previously” written. If there has not been any such write, then the initial value ⊥ is returned. The write-commit condition captures the intuition that, if a value is (successfully) written, then, unless there is a subsequent write, every subsequent successfully read must return that value. Informally, the two conditions (readcommit, write-commit) ensure that the value read is the “last” value written. To illustrate the behaviour of a round-based register, consider the example of Figure 1. Three processes p 1 , p2 and p3 access the same round-based register. Process p 1 invokes write(1, X) before any process invokes any operation on the register: operation write(1, X) commits and the value of the register is X: p 1 gets commit as a return value. Later, p2 invokes read(2) on the register: the operation commits and p 2 gets (commit, X) as a return value. If p 3 later invokes write(1, Y ), then the operation aborts: the return value is abort (because p 2 has invoked read(2)). The register value remains X. If p 3 later invokes write(3, Y ), the operation commits: the new register value is then Y . 7 Note

that even if a write() aborts, its value might be subsequently read, i.e., the write() operation is not atomic. that we deliberately do not restrict the case where different processes perform invocations with the same round number. Paxos indeed assumes round number uniqueness as we will see in Section 4. 8 Note

6


p1

write(1,X) commit read(2)

p2 commit,X write(1,Y) p3 abort

write(3,Y) commit

Figure 1. Round-based register example

3.2 Round-Based Consensus We introduce below our round-based consensus abstraction. This abstraction captures the subprotocol used in Paxos to agree on a total order. Our consensus notion corresponds to a single instance of total order, i.e., one batch of messages. To differentiate between consensus instances, i.e., batch of messages, we index the consensus instances with an integer (L). We represent our consensus notion in the form of a shared object with one operation: propose(k, v) [9]. This operation takes as input an integer k (i.e., a round number which is the same one used in the round-based register) and an initial value v in a domain V (i.e., a proposition for the consensus). It returns a status in {commit, abort} and a value in V . We say that a process p i proposes a value initi for round k when p i invokes function propose(k, init i ). We say that pi decides v in round k (or commits round k) when p i returns from the function propose(k, init i ) with commit and v. If the invocation of propose(k, v) returns abort at p i , we say that pi aborts round k. Round-based consensus has the following properties: • Validity: If a process decides a value v, then v was proposed by some process. • Agreement: No two processes decide differently. • Termination: If a propose(k, ∗) aborts, then some operation propose(k  , ∗) was invoked with k  ≥ k; if propose(k, ∗) commits, then no operation propose(k  , ∗) can subsequently commit with round k  ≤ k. The agreement and validity properties of our round-based consensus abstraction are similar to those of the traditional consensus abstraction [9]. Our termination property is however strictly weaker. If processes keep concurrently proposing values with increasing round numbers, then no process might be able to decide any value. In a sense, our notion of consensus has a conditional termination property. In comparison to [12], the author presents a consensus that does not ensure any liveness property. As stated by Lampson, the reason for not giving any liveness property is to avoid the applicability of the impossibility result of [4]. Our round-based consensus specification is weaker than consensus and does not fall into the impossibility result of [4], but nevertheless includes a liveness property. In the rest of the paper, when no ambiguity is possible, we shall simply use the term consensus instead of round-based consensus. In Figure 2, process p 2 commits consensus with value Y for round 2. Process p 1 then triggers consensus by invoking propose(1, X) but aborts because process p 2 proposed with a higher round number and prevents p 1 from committing. Process p1 then proposes with value X for round 4, and this time p 1 commits. Process p3 aborts when it proposes with value Z for round 3.

7


propose(4,X)

propose(1,X) p1

commit

abort propose(2,Y)

p2

commit propose(3,Z)

p3

abort

Figure 2. Round-based consensus example

3.3 Weak Leader Election Intuitively, a weak leader election abstraction is a shared object that elects a leader among a set of processes. It encapsulates the subprotocol used in Paxos to choose a process that decides on the ordering of messages. The weak leader election object has one operation, named leader(), which returns a process identifier, denoting the current leader. When the operation returns p j at time t and process p i , we say that pj is leader for pi at time t (or pi elects pj at time t). We say that a process p i is an eventual perpetual leader if (1) p i is correct, and (2) eventually every invocation of leader() returns p i . Weak leader election satisfies the following property: Some process is an eventual perpetual leader. It is important to notice that the property above does not prevent the case where, for an arbitrary period of time, various processes are simultaneously leaders. 9 However, there must be a time after which the processes agree on some unique correct leader. Figure 3 depicts a scenario where every process elects process p 1 , and then p 1 crashes; eventually every process elects then process p 2 . leader() p1 p1 p2

leader()

leader()

leader()

p1

leader() p3

crash

p1

p2 p3

leader()

leader()

p1

leader() p1

p2

leader() p2

leader() p2

Figure 3. Weak leader election example

3.4 Total Order Delivery The main problem solved by the actual Paxos protocol is to ensure total order delivery of messages (i.e., requests broadcast to replicas). 10 Total order broadcast is defined by two primitives: TO-Broadcast and TO-Deliver. We say that a process TO-Broadcasts a message m when it invokes TO-Broadcast with m as an input parameter. We say that a process TO-Delivers a message m when it returns from the invocation of TO-Deliver with m as an output parameter. Our total order broadcast protocol has the following properties: â&#x20AC;˘ Termination: If a process p i TO-Broadcasts a message m and then p i does not crash, then p i eventually TODelivers m. 9 In 10 In

this sense our weak leader election specification is strictly weaker then the notion of leader election introduced in [18]. fact, Paxos also deals with causal order delivery of messages, but we do not consider that issue here.

8


• Agreement: If a process TO-Delivers a message m, then every correct process eventually TO-Delivers m. • Validity: For any message m, (i) every process p i that TO-Delivers m, TO-Delivers m only if m was previously TO-Broadcast by some process, and (ii) every process p i TO-Delivers m at most once. • Total order: Let p i and pj be any two processes that TO-Deliver some message m. If p i TO-Delivers some message m before m, then p j also TO-Delivers m before m. It is important to notice that the total order property we consider here is slightly stronger from the one introduced in [8]. In [8], it is stated that if any processes p i and pj both TO-Deliver messages m and m  , then pi TO-Delivers m before m if and only if p j TO-Delivers m before m  . With this property, nothing prevents a process p i from TO-Delivering the sequence of messages m 1 ; m2 ; m3 whereas another (faulty) process TO-Delivers m 1 ; m3 without ever delivering m2 . Our specification clearly excludes that scenario and more faithfully captures the (uniform) guarantee offered by Paxos [11].

4 Abstractions: Implementations In the following, we give wait-free [9] implementations of our three abstractions and show how they can be used to implement a simple variant of the Paxos protocol in the particular case of a crash-stop model (following the architecture of Figure 4). We will show how to step to a crash-recovery model in the next section. Paxos Round-Based Consensus

Weak Leader Election

Round-Based Register

Communication

Figure 4. Architecture We simply assume here that messages are not lost or duplicated and processes that crash halt their activities and never recover. We also assume that a majority of the processes never crash and, for the implementation of our weak leader election abstraction, we assume the failure detector Ω introduced in [2].

4.1 Round-Based Register The algorithm of Figure 5 implements the abstraction of a round-based register. The algorithm works intuitively as follows. Every process p i has a copy of the register value, denoted by v i , and initialised to ⊥. A process reads or writes a value by accessing a majority of the copies with a round number. According to the actual round number, a process pi might “accept” or not the access to its local copy v i . Every process p i has a variable readi that represents the highest round number of a read() “accepted” by p i , and a variable write i that represents the highest round number of a write() “accepted” by p i . The algorithm is made up of two procedures (read() and write()) and two tasks that handle READ and WRITE messages. Each task is executed in one atomic step to avoid mutual exclusion problems for

9


1: procedure register() 2: readi ← 0 3: writei ← 0 4: vi ← ⊥ 5: procedure read(k) 6: send [READ,k] to all processes 7: wait until received [ackREAD,k,*,*] or [nackREAD,k] from  n+1 2  processes 8: if received at least one [nackREAD,k] then 9: return(abort, v) 10: else 11: select the [ackREAD,k, k  , v] with the highest k 12: return(commit, v) 13: procedure write(k, v) 14: send [WRITE,k, v] to all processes 15: wait until received [ackWRITE,k] or [nackWRITE,k] from  n+1 2  processes 16: if received at least one [nackWRITE,k] then 17: return(abort) 18: else 19: return(commit) 20: task wait until receive [READ,k] from pj 21: if writei ≥ k or readi ≥ k then 22: send [nackREAD,k] to pj 23: else 24: readi ← k 25: send [ackREAD,k, writei , vi ] to pj 26: task wait until receive [WRITE,k, v] from pj 27: if writei > k or readi > k then 28: send [nackWRITE,k] to pj 29: else 30: writei ← k 31: vi ← v 32: send [ackWRITE,k] to pj

{Constructor, for each process pi } {Highest read() round number accepted by pi } {Highest write() round number accepted by pi } {pi ’s estimate of the register value}

{read() is aborted} {read() is committed}

{write() is aborted} {write() is committed}

{A new value is “adopted”}

Figure 5. A wait-free round-based register in a crash-stop model the common variables. We assume here that a task is implemented as a thread in Java TM . Lemma 1. Read-abort: If read(k) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  ≥ k. Proof. Assume that some process p j invokes a read(k) that returns abort (i.e., aborts). By the algorithm of Figure 5, this can only happen if some process p i has a value readi ≥ k or writei ≥ k, which means that some process has invoked read(k  ) or write(k  ) with k  ≥ k.

Lemma 2. Write-abort: If write(k, ∗) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  > k. Proof. Assume that some process p j invokes a write(k) that returns abort (i.e., aborts). By the algorithm of Figure 5, this can only happen if some process p i has a value readi > k or writei > k, which means that some process has invoked read(k  ) or write(k  ) with k  > k.

Lemma 3. Read-write-commit: If read(k) or write(k, ∗) commits, then no subsequent read(k  ) can commit with k  ≤ k and no subsequent write(k  , ∗) can commit with k  < k. Proof. Let process p i be any process that commits read(k) (resp. write(k, ∗)). This means that a majority of the processes have “accepted” read(k) (resp. write(k, ∗)). For a process p j to commit read(k  ) with k  ≤ k (resp. write(k  ) with k  < k), a majority of the processes must “accept” read(k  ) (resp. write(k  , ∗)). Hence, at least one process must “accept” read(k) (resp. write(k, ∗)) and then read(k  ) with k  ≤ k (resp. write(k  , ∗) with k  < k) which is impossible by the algorithm of Figure 5: a contradiction.

10


Lemma 4. Read-commit: If read(k) commits with v and v =⊥, then some operation write(k  , v) was invoked with k  < k. Proof. By the algorithm of Figure 5, if some process p j commits read(k) with v =⊥, then (i) some process p i must have sent to pj a message [ackREAD,k, writej , v] and (ii) some process p m must have invoked write(k  , v) with k  < k. Otherwise pi would have sent [nackREAD,k] or [ackREAD,k, 0, ⊥]

✷ to p j .

Lemma 5. Write-commit: If write(k, v) commits and no subsequent write(k  , v  ) is invoked with k  ≥ k and v  = v, then any read(k  ) that commits, commits with v if k  > k. Proof. Assume that some process p i commits write(k, v), and assume that no subsequent write(k  , v  ) has been invoked with k  ≥ k and v  = v, and that for some k  > k some process pj commits read(k  ) with v  . Assume by contradiction that v = v  . Since read(k  ) commits with v  , by the read-commit property, some write(k  , v  ) was invoked before round k  . However, this is impossible since we assumed that no write(k  , v  ) operation with k  ≥ k and v  = v has been invoked, i.e., v i remains unchanged to v: a contradiction.

Proposition 6. The algorithm of Figure 5 implements a round-based register. ✷

Proof. Directly from lemmata 1, 2, 3, 4 and 5. Proposition 7. With a majority of correct processes, the implementation of Figure 5 is wait-free.

Proof. The only wait statements of the protocol are the guard lines that depicts the waiting for a majority of replies. These are non-blocking since we assume a majority of correct processes. Indeed, a majority of correct processes always send a message to the requesting process either of type [ackREAD, nackREAD], or of type [ackWRITE, nack✷

WRITE].

4.2 Round-Based Consensus The algorithm of Figure 6 implements a round-based consensus object that relies on a wait-free round-based register. The basic idea of the algorithm is the following. For a process p i to propose a value for a round k, p i first reads the value of the register with k, and if the read(k) operation commits, p i invokes a write(k, v) (or p i ’s initial value instead of v if no value has been written). If the write(k, v) operation commits, then the process decides the value written (i.e., returns this value). Otherwise, p i aborts and returns abort (line 7). Lemma 8. Validity: If a process decides a value v, then v was proposed by some process. Proof. Let p i be a process that decides some value v. By the algorithm of Figure 6, either (a) v is the value proposed by pi , in which case validity is satisfied, or (b) v has been read by p i in the register. Consider case (b), by the read-commit property of the register, some process p j must have invoked some write() operation. Let p j be the the first process that invokes write(k0 , ∗) with k0 equal to the smallest k ever invoked for write(k, v). By the algorithm of Figure 6,

11


{Constructor, for each process pi }

1: procedure consensus() 2: v ← ⊥; reg ← new register() 3: procedure propose(k, initi ) 4: if reg.read(k) = (commit, v) then 5: if (v =⊥) then v ← initi 6: if (reg.write(k, v) = commit) then return(commit, v) 7: return(abort, initi )

Figure 6. A wait-free round-based consensus using a wait-free round-based register there are two cases to consider: either (a) v is the value proposed by p j , in which case validity is ensured, or (b) v has been read by p j in the register. For case (b), by the read-commit property of the register, for p j to read v, some process pm must have invoked write(k  , v) with k  < k0 : a contradiction. Therefore, v is the value proposed by p j ✷

and validity is ensured. Lemma 9. Agreement: No two processes decide differently.

Proof. Assume by contradiction that two processes p i and pj decide two different values v and v  . Let pi decides v after committing propose(k, v) and p j decides v  after committing propose(k  , v  ). Assume without loss of generality that k  ≥ k. By the algorithm of Figure 6, p j must have committed read(k  ) before invoking write(k  , v  ). By the read-abort property, k  > k and by the write-commit property p j commits read(k  ) with v and then invokes write(k  , v). Even if write(k  , v) aborts, pj tries to write v and not v  = v. Therefore, the next time p j commits write(k  , v  ), then v  = v, i.e., decides v: a contradiction.

Lemma 10. Termination: If a propose(k, ∗) aborts, then some operation propose(k  , ∗) was invoked with k  ≥ k; if propose(k, ∗) commits, then no operation propose(k  , ∗) can subsequently commit with round k  ≤ k. Proof. For the first part, assume that some operation propose(k, ∗) invoked by p i aborts. By the algorithm of Figure 6, this means that pi aborts read(k) or write(k, ∗). By the read-abort property, some process must have proposed in a round k  ≥ k. Consider now the second part. Assume that some operation propose(k, ∗) invoked by p i commits. By the algorithm of Figure 6 and the read-write-commit property, no process can subsequently commit any read(k  ) with k  ≤ k  . Hence no process can subsequently commit a round k  ≤ k.

Proposition 11. The algorithm of Figure 6 implements a wait-free round-based consensus. Proof. Termination, agreement and validity follows from lemmata 8, 9 and 10. The implementation of round-based consensus is wait-free since it is based on a wait-free round-based register and does not introduce any “wait” statement. ✷

4.3 Weak Leader Election Figure 7 describes a simple implementation of a wait-free weak leader election. The protocol relies on the assumptions (i) that at least one process is correct and (ii) the existence of failure detector Ω [2]: Ω outputs (at each process) a trusted process, i.e., a process that is trusted to be up. Failure detector Ω satisfies the following property: There is a

12


time after which exactly one correct process p l is always trusted by every correct process.11 Our weak leader election relies on Ω in the following way. The output of failure detector Ω at process p i is denoted by Ω i . The function simply returns the value of Ω i . {For each process pi }

1: procedure leader() 2: return(Ωi )

Figure 7. A wait-free weak leader election with Ω Proposition 12. With failure detector Ω and the assumption that at least one process is correct, the algorithm of Figure 7 implements a wait-free weak leader election. Proof. Follows from the property of Ω [2].

4.4 A Simple Variant of Paxos The algorithm of Figure 9 can be viewed as a simple and modular version of Paxos in a crash-stop model (whereas the original Paxos protocol considers a crash-recovery model - see next section). The algorithm uses a series of consecutive round-based consensus (or simply consensus) instances: each consensus instance being used to agree on a batch of messages. Every process differentiates consecutive consensus instances by maintaining a local counter (L): each value of the counter corresponds to a specific consensus instance and is indexed to the propose() operation. Consensus instances are triggered according to the output of the weak leader election protocol: only leaders trigger consensus instances. We give here an intuitive description of the algorithm. When a process p i TO-Broadcasts a message m, p i consults the weak leader election protocol and sends m to leader p j . When pj receives m, pj triggers a new consensus instance by proposing all messages that it received (and not yet TO-Delivered) and set the round number to the process id. Note that in order to decide on a batch of messages, more than one consensus round might be necessary; various invocation consensus for the same batch (L) are differentiated with round number k. Due to round number uniqueness, no process can propose twice for the same round k. 12 In fact, pj starts a new task propose (L th ) that keeps on trying to commit consensus for this batch (L), as long as p j remains leader. If consensus commits, p j sends the decision to every process. Otherwise, task propose periodically invokes consensus with the same batch of messages but increases its round number by n, unless p j stops being leader or some consensus instance for the same batch commits. When p i elects another process p k , pi sends to pk every message that p i received, and not yet TO-Delivered. By the weak leader election property, eventually every correct process elects the eventual perpetual leader p l , and sends its messages to p l . By the round-based consensus specification, eventually p l commits consensus and sends the decision to every process. Once pi receives a decision for the L th batch of messages, p i stops task propose for this batch. Process p i TO-Delivers this batch of messages only if it is the next one that was expected, i.e., if p i has already TO-Delivered messages of 11 It was shown in [2] that Ω is the weakest failure detector to solve consensus and total order broadcast in a crash-stop system model. Failure detector Ω can be implemented in a message passing system with partial synchrony assumptions [3]. 12 Allowing two processes to propose for the same round could violate agreement. For example, process p invokes propose(1, v) and commits, 1 and process p2 invokes propose(1, v ). The termination property of consensus allows p2 to commit: agreement would indeed be violated. However, if pi invokes propose(1, v), crashes and recovers, p1 can then invoke propose(1, v) or even propose(1, v ) without violating the properties of round-based consensus.

13


TO-Broadcast m TO-Broadcast m prop(1,m)

Decision

p1

m TO-Deliver m

Round-Based Consensus

p1

TO-Deliver m

TO-Deliver m

TO-Deliver m p2 TO-Deliver m

p2 TO-Deliver m p3

TO-Deliver m

p3 TO-Deliver m p4

TO-Deliver m

p4

prop(5,m)

TO-Deliver m

Round-Based Consensus

p5 p5

Decision

(a) p1 is leader

TO-Broadcast m

p1

m

(b) p5 is leader

TO-Broadcast m

TO-Deliver m

m

m

p1

TO-Deliver m m

m

TO-Deliver m

TO-Deliver m

prop(2,m) Round-Based Consensus

p2

p2 TO-Deliver m

TO-Deliver m

p3

p3

TO-Deliver m

TO-Deliver m

p4

p4

TO-Deliver m

TO-Deliver m prop(5,m) p5

prop(5,m)

Round-Based Consensus

Round-Based Consensus

p5

Decision

Decision

(c) p1 first elects p3 and then p5

(d) p1 first elects p3 , then p2 and finally p5

Figure 8. Execution schemes batch L-1. If it is not the case, p i waits for the next expected batch (nextBatch) to respect total order. Within a batch of messages, processes TO-Deliver messages using a deterministic ordering function. Note that an array of round-based registers is used in the total order broadcast protocol: each round-based register corresponds to the “store and lock” of a given consensus instance. Finally, note that a process p i instantiates a roundbased register when (i) p i instantiates a round-based consensus, or (ii) p i receives for the first time a message for the Lth consensus, i.e., L th register of the array. Figure 8 depicts four typical execution schemes of the algorithm. We assume for all cases that (i) process p 1 TO-Broadcasts a message m, (ii) process p 5 is the eventual perpetual leader, and (iii) L =1. (prop(∗) stands in the figures for propose(∗).) In Figure 8(a), p 1 elects itself, triggers a new consensus instance by invoking propose(1, m), commits, and sends the decision to all. In Figure 8(b), p 1 elects p5 and sends m to p5 . Process p5 then invokes propose(5, m), commits, then sends the decision to all. In Figure 8(c), p 1 first elects p3 and sends m to p3 . In this case however, p 3 does not elect itself and therefore does nothing. Later on, p 1 elects p5 and then sends m to p 5 . As for case (b), p 5 commits consensus and sends the decision to every process. Note that p 3 could have sent m to p 5 if p3 had elected p5 . Finally, in Figure 8(d), p 1 elects p3 (which does not elect itself), then p 1 elects p2 , which elects itself and invokes propose(2, m) but aborts. Finally, p 1 elects p5 , and, as for case (c), p 5 commits consensus and sends the decision to all. Precise description. We give here more details about the algorithm of Figure 9. We first describe the main data structure, and then the main parts of the algorithm. Each process p i maintains a variable TO delivered that contains

14


the messages that were TO-Delivered. When p i receives a message m, pi adds m to the set Received which keeps track of all messages that need to be TO-Delivered. Thus Received - TO delivered, denoted TO undelivered, contains the set of messages that were submitted for total order broadcast, but are not yet TO-Delivered. The batches that have been decided but not yet TO-Delivered are put in the set AwaitingToBeDelivered. The variable nextBatch keeps track of the next expected batch in order to respect the total order property. There are four main parts in the protocol: (a) when a process receives some message, task launch starts 13 task propose if the process p i is leader, or if pi is not leader, sends the messages it did not yet TO-Delivered to the leader; (b) task propose keeps on starting round-based consensus while p i is leader, until a decision is reached; (c) primitive receive handles received messages, and stops task propose once p i receives a decision; and (d) primitive deliver TODelivers messages. Each part is described below in more details. Initially, when a process p i TO-Broadcasts a message m, pi puts m into the set Received which has the effect of changing the predicate of guard line 15. • In task launch, process p i triggers the upon case when the set TO undelivered contains new messages or whether pi elects another leader (line 15). Note that the upon case is executed only once per received message to avoid multiple consensus instances of the exact same batch of messages. If the upon case is triggered by a leader change, pi jumps directly to line 26 and sends to the leader all the messages it did not yet TO-Delivered. Otherwise, before starting a new consensus instance, p i first verifies at line 16 if (i) it already received the decision for this batch of messages, or (ii) it already TO-Delivered this batch of messages. Process p i verifies then if it is a leader, and if so, p i increments the batch number to initiate a consensus for a new batch of messages (L+1), i.e., p i starts task propose with TO undelivered as the batch of messages and the round number set to the id of pi . If pi is not leader, then p i sends the messages it did not yet TO-Delivered to the leader. • In task propose, a process p i periodically invokes consensus (proposes) if p i is leader. By the property of weak leader election, one of the correct processes (p l ) will be the eventual perpetual leader. Once p l is elected by every correct process, p l receives all batches of messages from every correct process, proposes and commits consensus (line 31) and then sends the decision to all (line 34). Note that in this primitive, p i proposes the same batch of messages but with an increasing round number. • In the primitive receive, when process p i receives the decision of consensus (line 36), p i first stops task proposeL : pi does not stop other batches (task propose) - i.e., this could influence the result of some other consensus instances (line 37). Process p i then verifies that the decision received is the next decision that was expected (nextBatch). Otherwise, there are two cases to consider: (i) p i is ahead, or (ii) p i is lagging. For case (i), if p i is ahead (i.e., receives a decision from a lower batch), p i sends to pj an UPDATE message for each batch that p j is missing (line 40). For case (ii), if p i receives a future batch, p i buffers the messages of the batch in the set AwaitingToBeDelivered and p i also sends to pj an UPDATE message with nextBatch-1 in order for p j to update itself (pi ) when pj receives this “on purpose lagging” message. Process p i waits until it gets the next expected batch in order to satisfy the total order property. 13 When

we say that a new task is started, we mean a new instance of the task with its own variables (since there can be more than one batch of messages being treated at the same time). Moreover, the variable TO delivered means the union of all arrays TO delivered[L].

15


• In the primitive deliver, process p i TO-Delivers only the messages that were not already TO-Delivered (line 9 or 12) following the same deterministic order. We assume that p i removes all messages that appear twice in the same batch of messages. We assume here a system model where messages keep being broadcast indefinitely. This assumptions is precisely what enables us to ensure the uniformity of agreement without additional forced logs and communication steps. Lemma 13. If the eventual perpetual leader proposes a batch of messages, it eventually decides. Proof. Assume by contradiction that process p i is the eventual perpetual leader that proposes a batch of messages and never decides. By the algorithm of Figure 9, p i keeps incrementing round number k (line 33). Let k 0 be the smallest round number reached by p i such that no process else than p i ever invokes any operation. By the algorithm of Figure 9, such round number exists because, unless it is leader, no other process invokes any operation on the consensus. By the termination property of consensus and since the implementation of consensus is wait-free, p i commits propose(k0 , ∗), ✷

which means that p i decides a value: a contradiction.

Lemma 14. Termination: If a process p i TO-Broadcasts a message m and then p i does not crash, then p i eventually TO-Delivers m. Proof. Suppose by contradiction that a process p i TO-Broadcasts a message m but never TO-Delivers m. Remember that every time p i elects a new process, pi sends m to this new leader. By the weak leader property, eventually p i elects the eventual perpetual leader process p l and pi sends m to pl . By lemma 13, p l proposes, decides and sends the decision to all processes. There are now two cases to consider: (a) p l does not crash, or (b) p l crashes. For case (a), by the properties of the channels, p i receives the decision from p l and TO-Delivers m: a contradiction. For case (b), if p l crashes, pl was not an eventual perpetual leader: a contradiction.

Lemma 15. Agreement: If a process TO-Delivers a message m, then every correct process eventually TO-Delivers m. Proof. Suppose by contradiction that a process p i TO-Delivers m and let p j be any correct process that does not TO-Deliver m. Process p i must have received the decision from some process p k (pk could be pi ). There are two cases to consider: (a) p k is a correct process, or (b) p k is a faulty process. For case (a), since p k TO-Delivered m, by the reliable properties of the channels, every correct process receives the decision and TO-Delivers m: a contradiction. For case (b), since we assume that new messages keep coming, the eventual perpetual leader p l TO-Delivers m and therefore sends at some time the decision to every correct process: a contradiction. As explained earlier, due to round number uniqueness, no two processes can propose for the same round, therefore every correct process decides the ✷

same value for consensus.

Lemma 16. Validity: For any message m, (i) every process p i that TO-Delivers m, TO-Delivers m only if m was previously TO-Broadcast by some process, and (ii) every process p i TO-Delivers m at most once. Proof. For the first part (i), suppose by contradiction that some process p i TO-Delivers a message m that was not TO-Broadcast by any process. For a message m to be TO-Delivered, by the algorithm of Figure 9, m must be decided

16


through round-based consensus. By the validity property of consensus, m has to be proposed (line 24). In order to be proposed, m has to be in the set TO undelivered (line 20); then to be in the set TO undelivered, m has to be in the set Received (line 46). Finally, for m to be in set Received, m has to be TO-Broadcast or sent (lines 6 & 26). Ultimately, for m to be sent, m must be TO-Broadcast: a contradiction. For the second part (ii), p i cannot TO-Deliver more than once a message m. This is impossible since line 8 removes all the messages that have been already TO-Delivered. Of course, we assume that p i distinguishes all messages that appear twice in the variable msgSet.

Lemma 17. Total order: Let p i and pj be any two processes that TO-Deliver some message m. If p i TO-Delivers some message m before m, then pj also TO-Delivers m before m. Proof. Suppose by contradiction that p i TO-Delivers a message m before a message m  and pj TO-Delivers m before m. There are two cases to consider: (a) m and m  are in the same message set, and (b) m and m  are in different message sets. For case (a), since every process delivers messages following the same deterministic order, m is delivered before m on both processes: a contradiction. For case (b), suppose that m is part of msgSet L and m ∈ msgSetL



where L < L . For m to be TO-Delivered, msgSet L has to be received as a DECIDE or UPDATE message (line 36). If pi TO-Delivers m before m  , then pj cannot TO-Deliver m  before m since the predicate of guard line 38 forbids pj to TO-Deliver batches of messages out of order: a contradiction. Nevertheless, p j could receive the L th batch of messages before the L th batch of messages, but the batch would be put in the set AwaitingtoBeDelivered.

Proposition 18. The algorithm of Figure 9 satisfies the termination, agreement, validity and total order properties. ✷

Proof. Directly from the lemmata 14, 15, 16 and 17.

5 A Faithful Deconstruction of Paxos This section describes a faithful and modular deconstruction of Paxos [11]. It is modular in the sense that it builds upon our abstractions: the specifications of these are not changed, only their implementations are slightly modified. It is faithful in the sense that it captures the practical spirit of the original Paxos protocol: it preserves the efficiency of Paxos and tolerates temporary crashes of links and processes. Just like with the original Paxos protocol, we preclude the possibility of unstable processes: either processes are correct (eventually always-up), or they eventually crash and never recover. We will come back to this assumption in the next section. To step from a crash-stop model to a crash-recovery model, we mainly adapt the round-based register and slightly modify the global protocol to deal with recovery (in shade in Figure 10(a), therefore we only present these abstractions in this section). Every process performs some forced logs so that it can consistently retrieve its state when it recovers. To cope with temporary link failures, we build upon a retransmission module, associated with two primitives s-send and s-receive: if a process p i s-sends a message to a correct process p j and pi does not crash, the message is eventually s-received.

17


1: For each process pi : 2: procedure initialisation: 3: Received[] ← ∅; TO delivered[] ← ∅; start task{launch} 4: TO undelivered ← ∅; AwaitingToBeDelivered[] ← ∅; K ← 1; nextBatch ← 1 5: procedure TO-Broadcast(m) 6: Received ← Received ∪ m 7: procedure deliver(msgSet) 8: TO delivered[nextBatch] ← msgSet - TO delivered 9: atomically deliver all messages in TO delivered[nextBatch] in some deterministic order {TO-Deliver} 10: nextBatch ← nextBatch +1 11: while AwaitingToBeDelivered[nextBatch] = ∅ do 12: TO delivered[nextBatch] ← AwaitingToBeDelivered[nextBatch]- TO delivered; atomically deliver TO delivered[nextBatch] 13: nextBatch ← nextBatch+1 14: task launch {Upon case executed only once per received message} 15: upon Received - TO delivered = ∅ or leader has changed do {If upon triggered by a leader change, jump to line 26} 16: while AwaitingToBeDelivered[K+1] = ∅ or TO delivered[K+1] = ∅ do 17: K ← K+1 18: if K = nextBatch and AwaitingToBeDelivered[K] = ∅ and TO delivered[K] = ∅ then 19: deliver(AwaitingToBeDelivered[K]) 20: TO undelivered ← Received − TO delivered 21: if leader()= pi then 22: while proposeK is active do 23: K ← K+1 24: start task proposeK (K, i, T O undelivered); K ← K+1 25: else 26: send(T O undelivered) to leader() 27: task propose(L, l, msgSet) {Keep on proposing until consensus commits} 28: committed ← false; consensusL ← new consensus() 29: while not committed do 30: if leader()= pi then 31: if consensusL .propose(l, msgSet) = (commit, returnedMsgSet) then 32: committed ← true 33: l ← l+n 34: send(DECISION,L, returnedMsgSet) to all processes 35: upon receive m from pj do K 36: if m = (DECISION,nextBatch,msgSet pj ) or m = (UPDATE,Kpj ,TO delivered[Kpj ]) then 37: if task proposeKpj is active then stop task proposeKpj 38: if Kpj = nextBatch then {pj is ahead or behind} if Kpj < nextBatch then {pj is behind} 39: for all L such that Kpj < L < nextBatch: send(UPDATE,L,TO delivered[L]) to p j {If pj = pi } 40: else 41: K 42: AwaitingToBeDelivered[Kpj ] = msgSet pj ; send(UPDATE,nextBatch-1,TO delivered[nextBatch-1]) to pj {If pj = pi } 43: else Kp 44: deliver(msgSet j ) 45: else 46: Received ← Received ∪ msgSetT O undelivered {Consensus messages are added to the consensus box}

Figure 9. A modular crash-stop variant of Paxos

TO-Broadcast m prop(1,m)

Paxos

TO-Deliver m

Round-Based Consensus W

p1

W readi

p2

W readi

W writei vi

W readi

W writei vi

W readi

W writei vi

TO_delivered,nextBatch

W readi

W writei vi

TO_delivered,nextBatch

W writei vi

TO_delivered,nextBatch

TO-Deliver m

- Stable storage, recovery procedure

Round-Based Consensus p3

Weak Leader Election

Round-Based Register p4

- Stable storage, - Recovery procedure Added module Modified module

W TO_delivered,nextBatch

TO-Deliver m W TO_delivered,nextBatch

TO-Deliver m W TO-Deliver m

p5

Retransmission module

p1 is leader

Communication

W

Decision

(a) Modules

(b) Added forced logs

Figure 10. The impact of a crash-recovery model

18


5.1 Retransmission Module We describe here a retransmission module that encapsulates retransmissions issues to deal with temporary crashes of communication links. The primitives of the retransmission module (s-send and s-receive) preserve the no creation and fair loss properties of the underlying channels, and ensures the following property: Let p i be any process that s-sends a message m to a process p j , and then p i does not crash. If p j is correct, then pj eventually s-receives m. Figure 11 gives the algorithm of the retransmission module. All messages that need to be retransmitted are put in the variable xmitmsg. Messages in xmitmsg are erased but the Paxos layer stops retransmitting messages except for the DECISION

or UPDATE messages once a decision has been reached. The no creation and fair loss properties are trivially

satisfied.

1: for each process pi : 2: procedure initialisation: 3: xmitmsg[] ← ∅; start task{retransmit} 4: procedure s-send(m) 5: if m ∈ xmitmsg then 6: xmitmsg ← xmitmsg ∪ m 7: if pj = pi then 8: send m to pj 9: else 10: simulate s-receive m from pi 11: upon receive(m) from pj do 12: s-receive(m) 13: task retransmit 14: while true do 15: for all m ∈ xmitmsg do 16: s-send(m)

{To s-send m to pj } {Ensure that m is not added to xmitmsg more than once}

{Retransmit all messages received and sent}

Figure 11. Retransmission module Proposition 19. Let p i be any process that s-sends a message m to a process p j , and then p i does not crash. If p j is correct, then p j eventually s-receives m. Proof. Suppose that p i s-sends a message m to a process p j and then does not crash. Assume by contradiction that p j is correct, yet pj does not s-receive m. There are two cases to consider: (a) p j does not crash, or (b) p j crashes and eventually recovers and remains always-up. For case (a), by the fair loss properties of the links, p j receives and then s-receives m: a contradiction. For case (b), since process p i keeps on sending m to p j , there is a time after which p i sends m to pj and none of them crash afterwards. As for case (a), by the fair loss property of the links, p j eventually ✷

receives m, then s-receives m: a contradiction.

5.2 Round-Based Register We give in Figure 12 the implementation of a round-based register in a crash-recovery model. The main differences with our crash-stop implementation given in the previous section are the following. As shown in Figure 10(b), a process logs the variables read i , writei and vi , in order to be able to recover consistently its precedent state after a crash. A recovery procedure re-initialises the process and retrieves all variables. The send (resp. receive) primitive is

19


1: procedure register() 2: readi ← 0 3: writei ← 0 4: vi ← ⊥ 5: procedure read(k) 6: s-send [READ,k] to all processes 7: wait until s-received [ackREAD,k,*,*] or [nackREAD,k] from  n+1 2  processes 8: if s-received at least one [nackREAD,k] then 9: return(abort, v) 10: else 11: select the [ackREAD,k, k  , v] with the highest k 12: return(commit, v) 13: procedure write(k, v) 14: s-send [WRITE,k, v] to all processes 15: wait until s-received [ackWRITE,k] or [nackWRITE,k] from  n+1 2  processes 16: if s-received at least one [nackWRITE,k] then 17: return(abort) 18: else 19: return(commit) 20: task wait until s-receive [READ,k] from pj 21: if writei ≥ k or readi ≥ k then 22: s-send [nackREAD,k] to pj 23: else 24: readi ← k; store{readi } 25: s-send [ackREAD,k, writei , vi ] to pj 26: task wait until s-receive [WRITE,k, v] from pj 27: if writei > k or readi > k then 28: s-send [nackWRITE,k] to pj 29: else 30: writei ← k 31: vi ← v; store{writei , vi } 32: s-send [ackWRITE,k] to pj 33: upon recovery do 34: initialisation 35: retrieve{writei , readi , vi }

{Constructor, for each process pi }

{Modified from Figure 5}

{Modified from Figure 5} {Added procedure to Figure 5}

Figure 12. A wait-free round-based register in a crash-recovery model also replaced by the s-send (resp. s-receive) primitive. Proposition 20. With a majority of correct processes, the algorithm of Figure 12 implements a wait-free round-based register. Lemma 21. Read-abort: If read(k) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  ≥ k. Lemma 22. Write-abort: If write(k, ∗) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  > k. Lemma 23. Read-write-commit: If read(k) or write(k, ∗) commits, then no subsequent read(k  ) can commit with k  ≤ k and no subsequent write(k  , ∗) can commit with k  < k. Lemma 24. Read-commit: If read(k) commits with v and v =⊥, then some operation write(k  , v) was invoked with k  < k. Lemma 25. Write-commit: If write(k, v) commits and no subsequent write(k  , v  ) is invoked with k  ≥ k and v  = v, then any read(k  ) that commits, commits with v if k  > k. The proofs for lemmata 21 through 25 are similar to those of lemmata 1 through 5 since: (a) if p i invokes a read() or a write() operation and then does not crash, by the property of the retransmission module, p i keeps on sending messages (e.g., READ messages for the read() operation) until it gets a majority of replies (e.g., ackREAD or nackREAD); (b) since all variables are logged before sending any positive acknowledgement messages, a process does not behave differently if it crashes and recovers. If a process crashes and recovers, it recovers its precedent state and

20


therefore acts as if it did not crash.

5.3 Weak Leader Election The implementation of the weak leader election does not change in a crash-recovery model. However, the failure detector â&#x201E;Ś has only been defined in a crash-stop model [2]. Interestingly, its definition (there is a time after which exactly one correct process p l is always trusted by every correct process) does not change in a crash-recovery model (the notion of correctness changes though). We give in Appendix B an implementation of the failure detector â&#x201E;Ś in a crash-recovery model with partial synchrony assumptions.

5.4 Modular Paxos Figure 10(b) shows that compared to a crash-stop version, the total order broadcast protocol adds (i) a recovery procedure, and (ii) one forced log to store the set TO delivered and the variable nextBatch. We now say that a process TO-Delivers a message m when the process logs m. In a stable period, a process can TO-Deliver a message after three forced logs and two round trip communication steps (if the leader is the process that broadcasts the message). Section 6.4 introduces a powerful optimisation that requires only one forced log at a majority of processes and one round-trip communication step (if the requesting process is leader). Proposition 26. With a wait-free round-based consensus, and a wait-free weak leader election, the algorithm of Figure 13 ensures the termination, agreement, validity and total order properties in a crash-recovery model without unstable processes. Lemma 27. Termination: If a process p i TO-Broadcasts a message m and then p i does not crash, then p i eventually TO-Delivers m. Lemma 28. Agreement: If a process TO-Delivers a message m, then every correct process eventually TO-Delivers m. Lemma 29. Validity: For any message m, (i) every process p i that TO-Delivers m, TO-Delivers m only if m was previously TO-Broadcast by some process, and (ii) every process p i TO-Delivers m at most once. Lemma 30. Total order: Let p i and pj be any two processes that TO-Deliver some message m. If p i TO-Delivers some message m before m then pj also TO-Delivers m before m. The proofs for lemmata 27 through 30 are identical to those of from lemmata 14 to 17 since: (a) if p i TO-Broadcasts m and then does not crash; by the property of the retransmission module, p i keeps on sending m to the leader, therefore the predicate at line 17 of Figure 13 becomes true at the eventual perpetual leader; (b) by the weak leader election property, one of the correct processes will be an eventual perpetual leader p l that decides; by its definition, p l is eventually always-up, and then eventually keeps on sending the decision to all processes, therefore all correct processes s-receive the decision (even those that crash and recover); (c) the implementation is build on a wait-free round-based register and on a wait-free round-based consensus that are tolerant to crash-recovery (without unstable processes); (d) when a process crashes and recovers, it retrieves its precedent state by retrieving TO delivered and nextBatch; (e) when

21


1: For each process pi : 2: procedure initialisation: 3: Received[] ← ∅; TO delivered[] ← ∅; start task{launch} 4: TO undelivered[] ← ∅; AwaitingToBeDelivered[]← ∅; K ← 1; nextBatch ← 1 5: procedure TO-Broadcast(m) 6: Received ← Received ∪ m 7: procedure deliver(msgSet) 8: TO delivered[nextBatch] ← msgSet - TO delivered; 9: atomically deliver all messages in TO delivered[nextBatch] in some deterministic order 10: store{TO delivered,nextBatch} {TO-Deliver, added to Figure 9} 11: nextBatch ← nextBatch +1 {Stop retransmission module ∀ messages of nextBatch-1 except DECIDE or UPDATE } 12: while AwaitingToBeDelivered[nextBatch] = ∅ do 13: TO delivered[nextBatch] ← AwaitingToBeDelivered[nextBatch]- TO delivered; atomically deliver TO delivered[nextBatch] 14: store{TO delivered,nextBatch} {Stop retransmission module ∀ messages of nextBatch except DECIDE or UPDATE } nextBatch ← nextBatch+1 15: 16: task launch {Upon case executed only once per received nessage} {If upon triggered by a leader change, jump to line 28} 17: upon Received - TO delivered = ∅ or leader has changed do 18: while AwaitingToBeDelivered[K+1] = ∅ or TO delivered[K+1] = ∅ do 19: K ← K+1 20: if K = nextBatch and AwaitingToBeDelivered[K] = ∅ and TO delivered[K] = ∅ then 21: deliver(AwaitingToBeDelivered[K]) 22: TO undelivered ← Received − TO delivered 23: if leader()= pi then 24: while proposeK is active do 25: K ← K+1 26: start task proposeK (K, i, T O undelivered); K ← K+1 27: else 28: s-send(T O undelivered) to leader() 29: task propose(L, l, msgSet) {Keep on proposing until consensus commits} 30: committed ← false; consensusL ← new consensus() 31: while not committed do 32: if leader()= pi then 33: if consensusL .propose(l, msgSet) = (commit, returnedMsgSet) then 34: committed ← true 35: l ← l+n 36: s-send(DECISION,L, returnedMsgSet) to all processes 37: upon s-receive m from pj do K 38: if m = (DECISION,nextBatch,msgSet pj ) or m = (UPDATE,Kpj ,TO delivered[Kpj ]) then 39: if task proposeKpj is active then stop task proposeKpj 40: if Kpj = nextBatch then {pj is ahead or behind} if Kpj < nextBatch then {pj is behind} 41: 42: for all L such that Kpj < L < nextBatch: s-send(UPDATE,L,TO delivered[L]) to p j {If pj = pi } 43: else Kp 44: AwaitingToBeDelivered[Kpj ] = msgSet j ; s-send(UPDATE,nextBatch-1,TO delivered[nextBatch-1]) to pj {If pj = pi } else 45: K 46: deliver(msgSet pj ) 47: else 48: Received ← Received ∪ msgSetT O undelivered {Consensus messages are treated in the consensus box} 49: upon recovery do {Added procedure to Figure 9} 50: initialisation 51: retrieve{TO delivered, nextBatch}; K ← nextBatch; nextBatch ← nextBatch+1; Received ← TO delivered

Figure 13. A modularisation of Paxos

22


recovering, Received is set to TO delivered otherwise the predicate of line 17 would never be false and would keep on proposing messages; and (f) since processes keep on broadcasting messages, the leader process eventually updates a process that has crashed and recovered with all lagging messages.

6 The Four Seasons This section presents four interesting variants of the Paxos protocol. Subsection 6.1 describes a variant of the protocol that alleviates the need for stable storage under the assumption that some processes never crash. This is obtained mainly by modifying the implementation of our round-based register. Subsection 6.2 describes a variant of the protocol that copes with unstable processes through a modification of our weak leader election implementation. Subsection 6.3 describes a variant of the protocol that guarantees progress even if only one process is correct. This is obtained through an implementation of our round-based register that assumes a decoupling between disks and processes, along the lines of [5]. Subsection 6.4 describes an optimised variant (Fast Paxos) of the protocol that is very efficient in stable periods. These variants are orthogonal, except 6.1 and 6.3 (because of their contradictory assumptions). Paxos

Paxos

- No need for stable storage, new recovery procedure

- Switch from regular to fast communication pattern

Round-Based Consensus Weak Leader Election

Fast Paxos

Paxos

Weak Leader Election

Round-Based Register - Some processes never crash - No need for stable storage

- Exchange of state of Failure Detector between processes - Needs a majority to trust a process

Round-Based Consensus

Round-Based Consensus Weak Leader Election

Round-Based Register - Decoupling disks & processes, - Majority of correct commodity disks vs correct processes

Round-Based Register

Fast Round-Based Consensus Weak Leader Election

- fastpropose() operation

Fast Round-Based Register - fastwrite() operation

Retransmission module

Retransmission module

Retransmission module

Retransmission module

Communication

Communication

Communication

Communication

(a) Winter

(b) Spring

(c) Summer

(d) Fall

Figure 14. Modified (in shade) modules from a crash-recovery variant

6.1 Winter: Avoiding Stable Storage Basically, we assume here that some of the processes never crash and, instead of stable storage, we store the crucial information of the register inside “enough” processes (in main memory). The protocol assumes that the number of processes that never crash (n a ) is strictly greater than the number of faulty processes: n f .14 As depicted by Figure 14(a), the weak leader election and the round-based consensus remain unchanged. We mainly change the round-based register implementation and we add to the Paxos protocol a recovery procedure that relies on initialisation messages instead of stable storage. Basically, a recovered process p i asks all other processes to return the set of messages that they have TO-Delivered and p i initialises its state using those messages. Round-Based Register. The trick in the round-based register implementation is to ensure that the register’s value is “locked” in at least one process that never crashes. Intuitively, any read() or write() uses a threshold that guarantees 14 Note

that na is not known while nf is.

23


this property, as we explain below. (The idea is inspired by [1].) When a process recovers, it stops participating in the protocol, except that it periodically broadcasts a RECOVERED message. When a process p i receives such message from a process p j , pi adds pj to a set Ri of processes (known to have recovered). This scheme allows any process to count the number of recovered processes. While collecting ackREAD or ackWRITE messages, if p i detects that a new process pk has recovered (R i = P revRi ), pi restarts the whole procedure of reading or writing. For p i to commit a read() (resp. write()) invocation), p i must receive max(n f +1, n-nf -|Ri |) ackREAD (resp. ackWRITE) messages.

1: seqrd (resp.seqwr) distinguishes the phases when pi has restarted to s-send READ (resp. WRITE) messages because p i received a RECOVERED message 2: procedure register() {Constructor, for each process pi } 3: readi ← 0 4: writei ← 0 5: vi ← ⊥ 6: Ri ← ∅; P revRi ← ∅ {Added to Figure 5} 7: seqrdpi ← 0; seqwrpi ← 0 {Variable use to distinguish retrial, added to Figure 5} 8: procedure read(k) 9: repeat {Added to Figure 5} 10: P revRi ← Ri ; seqrdpi ← seqrdpi + 1 11: s-send [READ,k, seqrdpi ] to all processes 12: wait until s-received [ackREAD,k, seqrdpi ,*,*] or [nackREAD,k, seqrd pi ] from max(nf +1, n-nf -|Ri |)processes 13: until Ri = P revRi {Added to Figure 5} 14: if s-received at least one [nackREAD,k, seqrdpi ] then 15: return(abort, v) 16: else 17: select the [ackREAD,k, seqrdpi , k , v] with the highest k 18: return(commit, v) 19: procedure write(k, v) 20: repeat {Added to Figure 5} P revRi ← Ri ; seqwepi ← seqwrpi + 1 21: 22: s-send [WRITE,k, seqwrpi , v] to all processes 23: wait until s-received [ackWRITE,k, seqwrpi ] or [nackWRITE,k, seqwrpi ] from max(nf +1, n-nf -|Ri |)processes 24: until Ri = P revRi {Added to Figure 5} 25: if s-received at least one [nackWRITE,k, seqwrpi ] then 26: return(abort) 27: else 28: return(commit) 29: task wait until s-receive [READ,k, seqrdpj ] from pj 30: if writei ≥ k or readi ≥ k then 31: s-send [nackREAD,k, seqrdpj ] to pj 32: else 33: readi ← k 34: s-send [ackREAD,k, seqrdpj , writei , vi ] to pj 35: task wait until s-receive [WRITE,k, seqwrpj , v] from pj 36: if writei > k or readi > k then 37: s-send [nackWRITE,k, seqwrpj ] to pj 38: else 39: writei ← k 40: vi ← v 41: s-send [ackWRITE,k, seqwrpj ] to pj 42: upon s-receive RECOVERED from pj do {Added procedures to Figure 5} 43: Ri ← Ri ∪ pj 44: upon recovery do 45: initialisation; readi ← ∞; writei ← ∞ {Do not reply to READ or WRITE msg} 46: s-send RECOVERED to all processes

Figure 15. A wait-free round-based register in a crash-recovery model without stable storage Proposition 31. The algorithm of Figure 15 implements a wait-free round-based register in a crash-recovery model without stable storage assuming that n a > nf . Lemma 32. Read-abort: If read(k) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  ≥ k. Lemma 33. Write-abort: If write(k, ∗) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with

24


k  > k. Lemma 34. Read-write-commit: If read(k) or write(k, ∗) commits, then no subsequent read(k  ) can commit with k  ≤ k and no subsequent write(k  , ∗) can commit with k  < k. Lemma 35. Read-commit: If read(k) commits with v and v =⊥, then some operation write(k  , v) was invoked with k  < k. Lemma 36. Write-commit: If write(k, v) commits and no subsequent write(k  , v  ) is invoked with k  ≥ k and v  = v, then any read(k  ) that commits, commits with v if k  > k. The proofs for lemmata 32 through 36 are identical to those of lemmata 21 through 25. They are based on the following aspects: (a) we assume that n a > nf ; (b) when a process crashes and recovers, it keeps on sending RE COVERED

messages which ensures that a recovered process is never considered correct; and (c) since a process waits

for the maximum between n f +1 and n-nf -|Ri |, the register’s value is always locked into at leastone always-up process. The Paxos Variant. Figure 16 presents a Paxos variant for a crash-recovery model without stable storage. Proposition 37. With a wait-free round-based consensus, and a wait-free weak leader election, the algorithm of Figure 16 ensures the termination, agreement, validity and total order properties in a crash-recovery model (without any stable storage) assuming that n a > nf . Lemma 38. Termination: If a process p i TO-Broadcasts a message m and then p i does not crash, then p i eventually TO-Delivers m. Lemma 39. Agreement: If a process TO-Delivers a message m, then every correct process eventually TO-Delivers m. Lemma 40. Validity: For any message m, (i) every process p i that TO-Delivers m, TO-Delivers m only if m was previously TO-Broadcast by some process, and (ii) every process p i TO-Delivers m at most once. Lemma 41. Total order: Let p i and pj be any two processes that TO-Deliver some message m. If p i TO-Delivers some message m before m, then pj also TO-Delivers m before m. The proofs for lemmata 38 through 41 are identical to those of lemmata 27 through 30 since the recovery procedure requests every participant to s-send back their state when they s-receive a RECOVERED message. A process that crashes and recovers receives the “latest state” from at least one always-up process.

6.2 Spring: Coping with Unstable Processes We discuss here a Paxos variant that copes with unstable processes, i.e., processes that keep crashing and recovering forever. We adapt our modular protocol by simply changing the implementation of our weak leader election protocol as depicted in Figure 14(b). All our other modules remain unchanged. Intuitively, the issue with unstable processes is the following. Consider an unstable process p i (i.e., pi keeps on crashing and recovering), and suppose that its Ω i module permanently outputs p i , whereas the correct processes permanently consider some other correct process p j as leader. This is possible since Ω “only” guarantees that some correct process is always trusted by every correct process. For instance, an unstable process is free to permanently elect itself. The presence of two concurrent leaders can prevent the commitment of any consensus decision and hence

25


1: For each process pi : 2: procedure initialisation: 3: Received[] ← ∅; TO delivered[] ← ∅; start task{launch} 4: TO undelivered[] ← ∅; AwaitingToBeDelivered[]← ∅; K ← 1; k ← 0; nextBatch ← 1 5: procedure TO-Broadcast(m) 6: Received ← Received ∪ m 7: procedure deliver(msgSet) 8: TO delivered[nextBatch] ← msgSet - TO delivered; 9: atomically deliver all messages in TO delivered[nextBatch] in some deterministic order {TO-Deliver} 10: nextBatch ← nextBatch +1 {Stop retransmission module ∀ messages of nextBatch-1 except DECIDE or UPDATE } 11: while AwaitingToBeDelivered[nextBatch] = ∅ do 12: TO delivered[nextBatch] ← AwaitingToBeDelivered[nextBatch]- TO delivered; atomically deliver TO delivered[nextBatch] 13: nextBatch ← nextBatch+1 {Stop retransmission module ∀ messages of nextBatch-1 except DECIDE or UPDATE} {Upon case executed only once per received message} 14: task launch 15: upon Received - TO delivered = ⊥ or leader has changed do {If upon triggered by a leader change, jump to line 26} while AwaitingToBeDelivered[K+1] = ∅ or TO delivered[K+1] = ∅ do 16: 17: K ← K+1 18: if K = nextBatch and AwaitingToBeDelivered[K] = ∅ and TO delivered[K] = ∅ then 19: deliver(AwaitingToBeDelivered[K]) 20: TO undelivered ← Received − TO delivered 21: if leader()= pi then 22: while proposeK is active do 23: K ← K+1 24: start task proposeK (K, i, T O undelivered); K ← K+1 25: else 26: s-send(T O undelivered) to leader() 27: task propose(L, l, msgSet) {Keep on proposing until consensus commits} 28: committed ← false; consensusL ← new consensus() 29: while not committed do 30: if leader()= pi then 31: if consensusL .propose(l, msgSet) = (commit, returnedMsgSet) then 32: committed ← true 33: l ← l+n 34: s-send(DECISION,L, returnedMsgSet) to all processes 35: upon s-receive m from pj do K 36: if m = (DECISION,nextBatch,msgSet pj ) or m = (UPDATE,Kpj ,TO delivered[Kpj ]) then 37: if task proposeK is active then stop task proposeK 38: if Kpj = nextBatch then {pj is ahead or behind} if Kpj < nextBatch then {pj is behind} 39: 40: for all L such that Kpj < L < nextBatch: s-send(UPDATE,L,TO delivered[L]) to p j {If pj = pi } 41: else Kp 42: AwaitingToBeDelivered[Kpj ] = msgSet j ; s-send(UPDATE,nextBatch-1,TO delivered[nextBatch-1]) to pj {If pj = pi } else 43: Kp 44: deliver(msgSet j ) 45: else 46: Received ← Received ∪ msgSetT O undelivered {Consensus messages are treated in the consensus box} 47: upon recovery do {Added procedure to Figure 9} 48: initialisation; s-send(UPDATE,0,∅) to all processes

Figure 16. A variant of Paxos in a crash-recovery model without stable storage

26


prevent progress. We basically need to prevent unstable processes from being leaders after some time. We modify our new leader election protocol as follows: (a) every process p k exchanges the output value of its Ω k with all other processes, and (b) the function leader() returns p l only when a majority of processes thinks that p l is leader. The latter step is required to avoid the following case. Imagine an unstable process p u that invokes leader() which returns pu , then crashes, recovers and keeps on doing the same scheme forever. Process p u always trusts itself which violates the Ω property. By waiting for a majority of processes, we ensure that the values (Ω i ) of at least one correct process belongs to the set Ω[]. Therefore, p u cannot trusts itself forever (or any unstable processes) since its epoch number is eventually greater than any correct process. This idea, inspired by [7], assumes a majority of correct processes. Note that this assumption is now needed both in the implementation of the register and in the implementation of the leader election protocol. We give the implementation of this new weak leader election in Figure 17 and it is easy to verify that the implementation is wait-free under the assumption that a majority of processes are correct. Now, the weak leader election exchanges the output of Ω between every process. However, this exchange phase can be piggy-backed on the I AM - ALIVE

messages in the implementation of Ω (see Appendix B). Thus, the exchange phase does not add any

communication steps.

1: 2: 3: 4: 5: 6: 7: 8:

initialisation: Ω[] ← ⊥; start task EXCHANGE procedure leader() wait until pl ∈  n+1 2  Ω[k] return(pl ) task exchange periodically send Ωpi to all processes upon receive Ωpj from pj do Ω[j] ← Ωpj

{Modified from Figure 7, for each process pi } {Added task to Figure 7}

Figure 17. A wait-free weak leader election with Ω and unstable processes Proposition 42. The algorithm of Figure 17 ensures that some process is an eventual perpetual leader. Proof. Suppose, by contradiction, there are more than one eventual perpetual leader or there is no eventual perpetual leader. Consider the first case, suppose that there are forever two eventual perpetual leaders. This contradicts the definition of an eventual perpetual leader. Now, consider the second case where there is no eventual perpetual leader. By the property of Ω failure detector, eventually all correct processes trust only one correct process p l . By line 3 of Figure 17, it is impossible for any process to elect forever a process other than p l . The leader() function is nonblocking since there is a majority of correct processes. So eventually the invocation of leader() at every process returns in a bounded time (or the process crashes) and always returns p l , so there is one eventual perpetual leader p l : ✷

a contradiction.

6.3 Summer: Decoupling Disks and Processes The Paxos protocol ensures progress only if there is a time after which a majority of the processes are correct. The need for this majority is due to the fact that a process cannot decide on a given order for any two messages, unless this information is “stored and locked” at a majority of the processes. If disks and processes can be decoupled,

27


which is considered a very reasonable assumption in some practical systems [5], a process might be able to decide on some order as long as it can “store and lock” that information within a majority of the disks. We simply modify the implementation of our round-based register (Figure 14(c)) to obtain a variant of Paxos that exploits that underlying configuration. In this Paxos variant, we assume that disks can be directly (and remotely) accessed by processes, and failures of disks and processes are separated. Every process has an assigned block on each disk, and maintains a record dblock[pi ] that contains three elements: read i , writei and vi ; disk[dj ][pk ] denotes the block on disk d j in which process pk writes dblock[pk ]. We denote by readd () (resp. writed ()) the operation of reading (resp. writing) on a disk. As in [5], we assume that every disk ensures that (i) an operation write d (k, ∗) cannot overwrite a value of an earlier round k  < k, and (ii) a process must wait for acknowledgements when performing a write d () operation, and (iii) writed () and readd () are atomic operations. The round-based register protocol works as follows. For the read() operation, a process p i tries to writed on each disk pj its dblock[pi ] (∀pj disk[pj ][pi ]). After writing, p i readsd for any p j and any pk : disk[pj ][pk ]. If pi readsd a block with a round that is lower than the round of the highest write i , the read() operation aborts. Otherwise, the read() commits and returns the value associated with the highest write i . A similar scheme is used for the write() operation. Note that the round-based register implementation is simpler than the previous round-based register due to the usage of disks.

1: procedure register() {Constructor, for each process pi } 2: The operation writed () stores the whole block into disk. For presentation clarity, we have put as a parameter the value that is actually modified. 3: procedure read(k) 4: writed (k) {readi = k} {Wait for a majority of disk block} 5: readd () 6: if (received a block with readj ≥ k or writej ≥ k) then return(abort, initi ) 7: choose vmax from the block with highest writej ; return(commit, vmax ) {vmax =⊥ if writej = 0} 8: procedure write(k, v) 9: writed (k, v) {writei = k, vi = v} {Wait for a majority of disk block} 10: readd () 11: if (received a block with readj > k or writej > k) then return(abort, v) else return (commit, v) 12: upon recovery do 13: readd (); readi ← MAX(readreceived ); writei ← MAX(writereceived ) {Read all blocks} 14: vi ← dblock[].vwritei {Take v from the block with the highest vi }

Figure 18. A wait-free round-based register built on commodity disks Lemma 43. Read-abort: If read(k) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  ≥ k. Proof. Assume that some process p j invokes a read(k) that returns abort (i.e., aborts). By the algorithm of Figure 18, this can only happen if some process p i has a value readi ≥ k or writei ≥ k (line 6), which means that some process has invoked read(k  ) or write(k  ) with k  ≥ k.

Lemma 44. Write-abort: If write(k, ∗) aborts, then some operation read(k  ) or write(k  , ∗) was invoked with k  > k. Proof. Assume that some process p j invokes a write(k, ∗) that returns abort (i.e., aborts). By the algorithm of Figure 18, this can only happen if some process p i has a value readi > k or writei > k (line 11), which means that some process has invoked read(k  ) or write(k  ) with k  > k.

28


Lemma 45. Read-write-commit: If read(k) or write(k, ∗) commits, then no subsequent read(k  ) can commit with k  ≤ k and no subsequent write(k  , ∗) can commit with k  < k. Proof. Remember that we assume that a write d (k  , ∗) cannot overwrite d a writed (k, ∗) with k  < k. In the algorithm of Figure 18, p i invokes writed () in both procedures, therefore p i cannot commit read(k  ) with k  ≤ k (line 6) or commit write(k  , ∗) with k  < k (line 11).

Lemma 46. Read-commit: If read(k) commits with v and v =⊥, then some operation write(k  , v) was invoked with k  < k. Proof. By the algorithm of Figure 18, if some process p j commits read(k) with v =⊥, then some process p i must have writed to some disk since vi is only modified in the write() operation. Otherwise v max would be equal ⊥.

Lemma 47. Write-commit: If write(k, v) commits and no subsequent write(k  , v  ) is invoked with k  ≥ k and v  = v, then any read(k  ) that commits, commits with v if k  > k. Proof. Assume that some process p i commits write(k, v), and assume that no subsequent write(k  , v  ) has been invoked with k  ≥ k and v  = v, and that for some k  > k some process pj commits read(k  ) with v  . Assume by contradiction that v = v  . Since read(k  ) commits with v  , by the read-commit property, some write(k  , v  ) was invoked before or at the same round k  . However, this is impossible since we assumed that no write(k  , v  ) operation with k  ≥ k and v  = v has been invoked, i.e., v i remains unchanged to v: a contradiction.

Proposition 48. The algorithm of Figure 18 implements a wait-free round-based register. Proof. Directly from lemmata 43, 44, 45, 46 and 47 and the fact that we assume a majority of correct disks.

6.4 Fall: Fast Paxos In Paxos, when a process p i TO-Broadcasts a message m, p i sends m to the leader process p l . When pl receives m, pl triggers a new round-based consensus instance by proposing a batch of messages. A round-based consensus is made up of two phases, a read phase and a write phase. The read phase figures out if some value was already written, while the write phase either writes a new value (if the register contained ⊥) or rewrites the last written value. In the specific case of k = 1 (i.e., the first round), p 1 can safely invoke the write(1, ∗) operation without reading: indeed, if any other process has read or written any value, the write(1, ∗) invocation of p 1 aborts. In this case, consensus (if it commits) can be reached significantly faster than in a “regular” scenario. Interestingly, this optimisation can actually be applied whenever the system stabilises (even if processes do not know when that occurs). Indeed, the key idea behind that optimisation is that p 1 knows that writing directly at round 1 is safe because in case of any other write, p 1 ’s write would be automatically aborted. In fact, once a leader gets elected and commits a value, the leader can send a new message to all processes indicating that, for the subsequent consensus instances, only this process can try to directly write onto the register. This new message can be piggy-backed onto the messages of the write() primitive, thus avoiding any additional communication steps. Moreover, the last decision is

29


piggy-backed onto the next consensus invocation, thus saving one more communication step. Hence, the optimised protocol goes through two modes. Whenever a leader p i commits consensus (in the initial regular mode), it switches to the fast mode and tries to directly impose its value for next consensus. If the system is stable, pi succeeds and hence needs only one forced log and one communication round trip. We introduce here a specific f astpropose() operation that invokes write() directly and ensures that only one process can invoke f astpropose() per consensus, i.e., per batch of messages (independently of the round number). A f astpropose() invokes write() with a round number range between 1 and n, while for propose(), i.e., regular write(), the round number range starts at n+1. This way, a process can differentiate a write() from a propose() or a f astpropose(). If the f astpropose() does not succeed, p i goes back to the regular mode. We implement this mode switching by refining our round-based consensus and round-based register abstractions. We give here the intuition. p1 is leader prop(1,m)

Round-Based Consensus

Fast TO-Deliver m Round-Based fastprop(1,m’)Consensus

p1 TO-Deliver m

p2 TO-Deliver m

TO-Broadcast m p3 m

TO-Deliver m

p4 TO-Deliver m

p5 Decision

m’ TO-Broadcast m’

Regular pattern (L)

precedent decision+ imposition m’ Fast pattern (L+1)

Figure 19. Communication steps for a regular followed by a fast communication pattern Basically, we change the initialisations of our round-based consensus and round-based register abstractions. We use, in their constructors, a boolean variable fast that is set to true (resp. false) to distinguish the two cases. We add one specific operation f astpropose() to the interface of round-based consensus. Our modular Paxos protocol is also slightly modified to invoke the f astpropose() operation. Figure 19 depicts the different communication steps schemes; for clarity, we omit forced logs. Process p 1 executes a regular communication pattern for message m and then a fast communication pattern for the next consensus (message m  ). First, p3 elects p1 and sends m to p1 . When p1 commits consensus for batch L and with the permission to allow the next batch to be performed in a fast mode, p1 switches to the fast mode for batch L+1. When p 5 TO-Broadcasts m , p5 elects p1 and sends m to p1 . Process p1 then imposes the decision for batch L+1 and piggy-backs the last decision (L) on the same consensus invocation (L+1). The L+1 batch of messages is decided but will be TO-Delivered only with the next batch of messages (L+2). Fast Round-Based Register. The fast round-based register has similar read() and write() operations than a regular round-based register. A variable permission is added to the returned values of the write() primitive: permission is set to true if the variable v from the current and the next consensus are empty, otherwise it is set to false. The variable permission indicates to the upper layer that the process can directly invoke Fast Paxos for the next consensus. If a

30


process pi receives a nackWRITE message, it returns (abort,false). If p i gathers only ackWRITE message, then it returns (commit,true) only if p i received only ackWRITE messages with permission set to true, otherwise p i returns (commit,false). Note that if vi is modified and stored after permission is set, indeed only one process can perform a Fast Paxos per consensus. Fast round-based register has a different constructor since it extracts (if there is any) the decision that is piggy-backed from the invocation and simulates the reception of a DECIDE message. Note also that line 32 of Figure 20 prevents the violation of the agreement property. 15

1: procedure register() 2: readi ← 0 3: writei ← 0 4: vi ← ⊥ 5: if any, extract msgSet and Kpj and simulate the receive of a message (DECIDE,Kpj ,msgSet) 6: permission ← false 7: procedure read(k) 8: s-send [READ,k] to all processes 9: wait until received [ackREAD,k,*,*] or [nackREAD,k] from  n+1 2  processes 10: if received at least one [nackREAD,k] then 11: return(abort, v) 12: else 13: select the [ackREAD,k, k  , v] with the highest k 14: return(commit, v) 15: procedure write(k, v) 16: s-send[WRITE,k, v] to all processes 17: wait until received [ackWRITE,k,*] or [nackWRITE,k] from  n+1 2  processes 18: if received at least one [nackWRITE,k] then 19: return(abort,false) 20: else 21: if received at least one [ackWRITE,k,false] then return(commit,false) else return (commit,true) 22: task wait until receive [READ,k] from pj 23: if writei ≥ k or readi ≥ k then 24: s-send [nackREAD,k] to pj 25: else 26: readi ← k; store{readi } 27: s-send [ackREAD,k, writei , vi ] to pj 28: task wait until received [WRITE,k, v] from pj 29: if writei > k or readi > k then 30: s-send [nackWRITE,k] to pj 31: else 32: if k ≤ n then writei ← n+ 12 else writei ← k 33: permission ← ((vi = ⊥) and (vi+1 = ⊥)) 34: vi ← v; store{writei , vi } 35: s-send [ackWRITE,k,permission] to pj 36: upon recovery do 37: initialisation 38: retrieve{writei , readi , vi }

{Constructor, for each process pi }

{Added from Figure 12} {Added from Figure 12}

{Modified from Figure 12}

{Modified from Figure 12}

Figure 20. Wait-free fast round-based register Fast Round-Based Consensus. Fast round-based consensus has a parameterised constructor: fast indicates if the mode is fast or not, and the new constructor instantiates a new register using the fast parameter. Fast round-based consensus exports the primitive propose() of a regular round-based consensus (augmented with the return value nextFast). The variable nextFast is a boolean that indicates if the next batch of messages can be executed in a fast manner. Its value is set to the return value of the fast round-based register (permission). Moreover, nextFast is set in such way that for a particular batch L, it returns true only once independantly of the number of invocation of propose() or f astpropose(). A process pi can perform Fast Paxos for batch L+1 only if p i commits consensus (either by propose() or f astpropose()) for batch L with nextFast set to true. The fast round-based consensus also exports a new primitive 15 Variable write is set to a value between n and n + 1. If set to n + 1, the invocation of write(n + 1) would abort and hence require an added i round. If write is set to n, then the agreement property can be violated since two fast write can occur, e.g., write(1), write(n).

31


f astpropose() that takes as input an integer and an initial value v (i.e., a proposition for the fast consensus). It returns a status in {commit, abort}, a value v  and a boolean value nextFast. The f astpropose() primitive is a propose() primitive that satisfies the validity and agreement properties of the regular propose() primitive plus the following Fast Termination property if f astpropose() is invoked only with round number n ≥ k ≥ 1: • Fast Termination: If some operation f astpropose(∗, ∗) aborts, then some operation f astpropose(−, −) was invoked; if f astpropose(∗, ∗) commits then no different operation f astpropose(−, −) can commit. In fact, the f astpropose() primitive is straightforward to implement since it only invokes the write() primitive with round number between 1 and n of the fast round-based register.

1: procedure consensus(f ast) {Constructor, for each process pi , modified from Figure 6} 2: v ← ⊥; reg ← new register(); writeRes ← abort; nextFast ← false {Initialisation,modified from Figure 6} 3: procedure propose(k, initi ) 4: if reg.read(k) = (commit, v) then 5: if (v =⊥) then v ← initi 6: (writeRes,nextFast) ← reg.write(k, v) 7: if writeRes=commit then return(commit, v,nextFast) else return(abort, initi ,nextFast) 8: return(abort, initi ,false) 9: procedure fastpropose(k, initi ) {Added from Figure 6} 10: (writeRes,nextFast) ← reg.write(k, initi ) 11: if writeRes=commit then return(commit, initi ,nextFast) else return(abort, initi ,nextFast)

Figure 21. Wait-free fast round-based consensus Lemma 49. Fast Termination: If some operation f astpropose(∗, ∗) aborts, then some operation f astpropose(−, −) was invoked; if f astpropose(∗, ∗) commits then no different operation f astpropose(−, −) can commit. Proof. We assume here that processes invoke f astpropose() only with round number n ≥ k ≥ 1. There are two cases to consider: (i) two different processes invoke f astpropose() for the same consensus, or (ii) a process invokes f astpropose() twice for the same consensus. Consider case (i), let us assume by contradiction that two different processes pi and pj invoke f astpropose(). Assume moreover that p i returns from f astpropose(), by line 32 of Figure 20, when pj tries to invoke f astpropose(), by the algorithm of Figure 20, p j cannot succeed since write i is already set to n+ 12 : a contradiction. Now consider case (ii). Assume that p i invokes f astpropose() twice for the same consensus number, since write i is stored, pi cannot commit twice f astpropose() with nextFast set to true: a contradiction.✷ Proposition 50. If f astpropose() is invoked only once, then Figure 21 implements a wait-free fast round-based consensus in a crash-recovery model. Proof (sketch). The proof is based on lemma 49 and the fact that the proofs of the validity and agreement properties ✷

are similar to the proofs of lemmata 8 and 9.

Fast Paxos. Intuitively, once a process p i returns from propose() or f astpropose() with nextFast set to true for batch L, it implies that a process has the permission to execute a fast consensus, i.e., invoke f astpropose() for batch L+1. We slightly modify the Paxos algorithm by adding an array f ast[] that is set to false initially. When a process p i decides for batch L (in the regular mode), p i sends the decision to every process and sets the variable fast[L+1] to true

32


if f astpropose() or propose() returns with nextFast set to true (changes from a regular to a fast mode for the next consensus). The next time p i invokes a new consensus (fast[L] is true), p i (i) piggy-backs the last decision (if there is any) to the new instantiation of consensus, and (ii) invokes f astpropose(). This invocation has a different impact on the round-based register as explained earlier. When p i commits f astpropose(), pi (a) does not need to send the decision to every process since the decision is piggy-backed onto the next consensus invocation, and (b) sets fast of the next consensus to true so that p i can perform again a Fast Paxos. When p i aborts f astpropose(), pi sets fast back to false since pi cannot force the decision for this consensus, i.e., the communication pattern becomes regular again. Note that it is necessary in the fast mode that the last decision (if there is any) to be piggy-backed onto the invocation of the constructor of our round-based register. Otherwise, the process that creates the round-based register will not be able to TO-Deliver the last decision. Since there can be concurrent executions of consensus, when a process commits a regular consensus for batch L, the next fast consensus will not always be batch L+1. Consider the following example, if a process pi starts three consensus for batch number L=1,2, and 3; when p i commits batch number L=1, p i sets fast to true for batch number 2 and not 4 (only the subsequent batch number of L is set to true and not the last batch number started). Note also that the last decision piggy-backed is TO deliver[L-1] but it can be empty. In this case, the last decision piggy-backed is the latest decision that p i has, e.g, AwaitingToBeDelivered[latestDecisionReceived] or TO delivered[lastestTODelivered]. Note that we assume here that lines 24 and 25 are executed atomically. Lemma 51. There can be only one invocation of f astpropose() per consensus. Proof. By the algorithm of Figure 22, processes invoke f astpropose() only with round number n ≥ k ≥ 1. There are two cases to consider: (i) two different processes invoke f astpropose() for the same consensus, or (ii) a process invokes f astpropose() twice for the same consensus. Consider case (i), let us assume by contradiction that two different processes p i and pj invoke f astpropose() for consensus number L+1. For both processes, to invoke f astpropose() for consensus L+1, fast[L+1] must be set to true, which requires a process to perform a successful propose() (or f astpropose()) which returns nextFast as true for consensus L. Assume that p i returns from propose() (or f astpropose()) with nextFast to true: a majority of processes have returned with permission set to true (hence vL = ⊥ at a majority of processes) and no process has returned with permission set to false. When p j invokes propose() or f astpropose(), by the algorithm of Figure 20, p j has to return with nextFast to false since two majorities will always intersect: a contradiction. Now consider case (ii). Assume that p i invokes f astpropose() twice for the same consensus number L+1, by the algorithm of Figure 22, p i must have crashed and recovered between the two invocations of f astpropose(). When p i recovers, fast[L+1] is reset to false (initialisation). To invoke f astpropose() after having recovered, p i has to perform a successful propose() (or f astpropose()) with nextFast set to true for consensus L. This is impossible because a majority of processes have already their v L = ⊥: a contradiction.

Proposition 52. With a wait-free round-based consensus, and a wait-free weak leader election, the algorithm of Figure 22 ensures the termination, agreement, validity and total order properties in a crash-recovery model. Lemma 53. Termination: If a process p i TO-Broadcasts a message m and then p i does not crash, then p i eventually TO-Delivers m. Lemma 54. Agreement: If a process TO-Delivers a message m, then every correct process eventually TO-Delivers m.

33


1: For each process pi : 2: procedure initialisation: 3: Received[] ← ∅; TO delivered[] ← ∅; fast[] ← {false,..} {Modified from Figure 13} 4: TO undelivered ← ∅; AwaitingToBeDelivered[]← ∅; K ← 1; nextBatch ← 1; start task{launch} 5: procedure TO-Broadcast(m) 6: Received ← Received ∪ m 7: procedure deliver(msgSet) 8: TO delivered[nextBatch] ← msgSet - TO delivered; 9: atomically deliver all messages in TO delivered[nextBatch] in some deterministic order 10: store{TO delivered,nextBatch} 11: nextBatch ← nextBatch +1 {Stop retransmission module ∀ messages of nextBatch-1 except DECIDE or UPDATE } 12: while AwaitingToBeDelivered[nextBatch] = ∅ do 13: TO delivered[nextBatch] ← AwaitingToBeDelivered[nextBatch]- TO delivered; atomically deliver TO delivered[nextBatch] 14: store{TO delivered,nextBatch} 15: nextBatch ← nextBatch+1 {Stop retransmission module ∀ messages of nextBatch-1 except DECIDE or UPDATE} {Upon case executed only once per received message} 16: task launch 17: upon Received - TO delivered = ∅ or leader has changed do {If upon triggered by a leader change, jump to line 28} 18: while AwaitingToBeDelivered[K+1] = ∅ or TO delivered[K+1] = ∅ do 19: K ← K+1 20: if K = nextBatch and AwaitingToBeDelivered[K] = ∅ and TO delivered[K] = ∅ then 21: deliver(AwaitingToBeDelivered[K]) 22: TO undelivered ← Received − TO delivered 23: if leader()= pi then 24: while proposeK is active do 25: K ← K+1 26: start task proposeK (K, i, T O undelivered); K ← K+1 27: else 28: s-send(T O undelivered) to leader() 29: task propose(L, l, msgSet) {Modified from Figure 13} 30: committed ← false 31: if fast[L] then {Added from Figure 13} 32: piggy-back TO delivered[L-1] (if not empty) otherwise latest decision onto next instantiation and invocation of consensus 33: consensusL ← new consensus(true) 34: if consensusL .fastpropose(l, msgSet) = (commit, returnedMsgSet,nextFast) then 35: if L = nextBatch then deliver(returnedMsgSet) else AwaitingToBeDelivered[L] = returnedMsgSet; committed ← true 36: fast[L] ← false; fast[L+1] ← nextFast 37: if consensusL = ⊥ then consensusL ← new consensus(false) 38: while not committed do 39: l←l+n 40: if leader()= pi then 41: if consensusL .propose(l, msgSet) = (commit, returnedMsgSet,nextFast) then 42: committed ← true; s-send(DECISION,L, returnedMsgSet) to all processes; fast[L+1] ← nextFast 43: else 44: fast[L+1] ← false 45: upon s-receive m from pj do K 46: if m = (DECISION,nextBatch,msgSet pj ) or m = (UPDATE,Kpj ,TO delivered[Kpj ]) then 47: if task proposeKpj is active then stop task proposeKpj 48: if Kpj = nextBatch then {pj is ahead or behind} 49: if Kpj < nextBatch then {pj is behind} 50: for all L such that Kpj < L < nextBatch: s-send(UPDATE,L,TO delivered[L]) to p j {If pj = pi } else 51: Kp 52: AwaitingToBeDelivered[Kpj ] = msgSet j ; s-send(UPDATE,nextBatch-1,TO delivered[nextBatch-1]) to pj {If pj = pi } else 53: K 54: deliver(msgSet pj ) 55: else 56: Received ← Received ∪ msgSetT O undelivered {Consensus messages are treated in the consensus box} 57: upon recovery do 58: initialisation 59: retrieve{TO delivered, nextBatch}; K ← nextBatch; nextBatch ← nextBatch+1; Received ← TO delivered

Figure 22. Fast Paxos in a crash-recovery model

34


Lemma 55. Validity: For any message m, (i) every process p i that TO-Delivers m, TO-Delivers m only if m was previously TO-Broadcast by some process, and (ii) every process p i TO-Delivers m at most once. Lemma 56. Total order: Let p i and pj be any two processes that TO-Deliver some message m. If p i TO-Delivers some message m before m, then pj also TO-Delivers m before m. By lemma 51, the proofs for lemmata 53 through 56 are identical to those of lemmata 27 through 30 since (a) the properties of the f astpropose() primitive are more restrictive than the propose() primitive; and (b) the properties of the regular propose() remain the same.

7 Related Work The contribution of this paper is a faithful deconstruction of the Paxos replication algorithm. Our deconstruction is faithful in the sense that it preserves the efficiency of the original Paxos algorithm. This promotes the implementation of the algorithm in a modular manner, and the reconstruction of variants of it that are customised for specific environments. In [12, 16], the authors focused on the consensus part of Paxos with the aim of either explaining the algorithm and emphasising its importance [12] or proving its correctness [16]. In [12, 16], the authors discussed how a state machine replication algorithm can be constructed as a sequence of consensus instances. As they pointed out however, that might not be the most efficient way to obtain a replication scheme. Indeed, compared to the original Paxos protocol, additional messages and forced logs are required when relying on a consensus box. This is in particular because the very nature of traditional consensus requires every process to start consensus, i.e, adds messages compared to Paxos, and, in a crash-recovery model, every process needs to log its initial value. Considering a finer-grained and roundbased consensus abstraction, separated from a leader election abstraction, is the key to our faithful deconstruction of the Paxos replication algorithm. Our round-based consensus allows a process to propose more than once without implying a forced log, and allows us to merge all logs at the lowest abstraction level while exporting the round number up to the total order broadcast layer. Our round-based consensus abstraction is somehow similar to the “weak” consensus abstraction identified by Lampson in [12]. There are two fundamental differences. “Weak” consensus does not ensure any liveness property. As stated by Lampson, the reason for not giving any liveness property is to avoid the applicability of the impossibility result of [4]. Our round-based consensus specification is weaker than consensus and does not fall into the impossibility result of [4], but nevertheless includes a liveness property. The termination property of our round-based consensus coupled with our leader election property is precisely what allows us to ensure progress at the level of total order broadcast. In [5], a variant of Paxos, called Disk Paxos, decouples processes and stable storage. A crash-recovery model is assumed and progress requires only one process to be up and a majority of functioning disks. Thanks again to our modular approach, we implement Disk Paxos by only modifying the implementation of our round-based register. The algorithm of Section 6.3 is faithful to Disk Paxos in that both have the same number of forced logs, messages and communication steps. 16 Note that our leader election implementation that copes with unstable processes can be used 16 Variables

bal, mbal and inp in [5] correspond to writei , readi and vi in our case, while a ballot number in [5] corresponds to a round number

35


with Disk Paxos to improve its resilience. Independently of Paxos, [15] presented a replication protocol that also ensures fast progress in stable periods of the system: our Fast Paxos variant can be viewed as a modular version of that protocol. In [13], a new failure detector, ✸C, is introduced. This failure detector, which is shown to be equivalent to Ω, adds to the failure detection capability of ✸S [3] an eventual leader election flavour. Informally, this flavour allows every correct process to eventually choose the same correct process as leader and eventually ensure fast progress. We have shown that Ω can be directly used for that purpose, and we have done so in a more general crash-recovery model. Finally, [17] have given a total order broadcast in a crash-recovery model based on a consensus box [3]. As we pointed out, by using consensus as a black box, all processes need to propose an initial value which, in a crash-recovery model, means that they all need a specific forced log for that (this issue was also pointed in [17]). Precisely because of our round-based consensus abstraction, we are able to alleviate the need for this forced log. Acknowledgements. We are very grateful to Marcos Aguilera and Sam Toueg for their helpful comments on the specification of our register abstraction. We would also like to thank the anonymous reviewers for their very helpful comments.

References [1] M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. Distributed Computing, 13(2):99–125, May 2000. [2] T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996. [3] T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996. [4] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382, April 1985. [5] E. Gafni and L. Lamport. Disk paxos. In Proceedings of the 14th International Symposium on Distributed Computing (DISC’00), Lecture Notes in Computer Science, Toledo, Spain, October 2000. Springer-Verlag. [6] R. Guerraoui. Indulgent algorithms. In Proceedings of the 19th ACM Symposium on Principles of Distributed Computing (PODC’00), pages 289–298, Portland, OR, USA, July 2000. [7] R. Guerraoui and A. Schiper. Gamma-Accurate failure detectors. In Proceedings of the 10th International Workshop on Distributed Algorithms (WDAG’96), number 1151 in Lecture Notes in Computer Science, Bologna, Italy, October 1996. Springer-Verlag. [8] V. Hadzilacos and S. Toueg. Fault-tolerant broadcasts and related problems. In S. Mullender, editor, Distributed Systems, ACM Press Books, chapter 5, pages 97–146. Addison-Wesley, second edition, 1993. [9] M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124–149, January 1991. [10] M. Herlihy and J. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463–492, July 1990. [11] L. Lamport. The part-time parliament. Technical Report 49, Systems Research Center, Digital Equipement Corp, Palo Alto, September 1989. A revised version of the paper also appeared in ACM Transaction on Computer Systems vol.16 number 2. [12] B. Lampson. How to build a highly available system using consensus. In Proceedings of the 10th International Workshop on Distributed Algorithms (WDAG’96), pages 1–15, Bologna, Italy, 1996. in our case. As described in [5], Disk Paxos has two phases: (i) choose a value v, and (ii) try to commit v. In fact, phase 1 corresponds to the read() operation of our round-based register, while phase 2 corresponds to its write() operation. In both phases, processes perform one forced log (writed ) and readd all blocks. Our read() and write() operations also perform the same steps (writed and readd all disks).

36


[13] M. Larrea, A. Fernandez, and S. Ar´evalo. Eventually consistent failure detectors (short abstract). In Proceedings of the 14th International Symposium on Distributed Computing (DISC’00), Lecture Notes in Computer Science, Toledo, Spain, October 2000. Springer-Verlag. [14] N. A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996. [15] B. Oki and B. Liskov. Viewstamped replication: A general primary copy method to support highly available distributed systems. In Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (PODC’88), pages 8–17, Toronto, Ontario, Canada, August 1988. [16] R. De Prisco, B. Lampson, and N. Lynch. Revisiting the Paxos algorithm. Theoretical Computer Science, 243(1-2):35–91, July 2000. [17] L. Rodrigues and M. Raynal. Atomic broadcast in asynchronous systems where processes can crash and recover. In Proceedings of the 20th IEEE International Conference on Distributed Computing Systems (ICDCS’00), pages 288–295, Taipei, Taiwan, April 2000. [18] L. Sabel and K. Marzullo. Election vs. consensus in asynchronous systems. TR 95-1488, Computer Science Department, Cornell University, February 1995. [19] F. Schneider. Replication management using the state-machine approach. In S. Mullender, editor, Distributed Systems, ACM Press Books, chapter 7, pages 169–198. Addison-Wesley, second edition, 1993.

37


A

Optional Appendix. Performance measurements We have implemented our abstractions on a network of Java machines as a library of distributed shared objects. We

give here some performance measurements of our modular Paxos implementation in different configurations. These measurements were made on a LAN interconnected by Fast Ethernet (100Mb/s) on a normal working day. The LAN consisted of 60 UltraSUN 10 (256Mb RAM, 9 Gb Harddisk) machines. All stations were running Solaris 2.7, and our implementation was running on Solaris Java HotSpot TM Client VM (build 1.3.0 01, mixed mode). The effective message size was of 1Kb and the performance tests consider only cases where as many broadcasts as possible are executed. In all tests, we considered stable periods where process p 0 was the leader and one process was running per machine. 60

60

Fast Paxos Regular Paxos

50

# of TO_Deliver/sec

# of TO_Deliver/sec

50

sender = all sender = all/{p0} sender = {p1} sender = {p0}

40

30

20

10

40

30

20

10

0

0 0

5

10

15 # of processes

20

25

30

0

(a) Fast Paxos vs Regular Paxos

5

10

15 # of processes

20

25

30

(b) Varying the number of broadcasters (Fast Paxos)

Figure 23. Broadcast performance Figure 23(a) depicts the throughput difference between Regular Paxos and Fast Paxos. Not surprisingly, Fast Paxos has a higher throughput. The overall performance of both algorithms decreases since the leader has to send and receive messages from an increasing number of processes. Figure 23(b) depicts the performance of Fast Paxos when the number of broadcasting processes increases. We considered four cases, (i) only the leader broadcasts, (ii) one process other than the leader broadcasts, (iii) all processes except the leader broadcast, and (iv) all processes broadcast. Distributing the load of the broadcasting processes to a larger number of processes improves the average throughput. As expected, the throughput is lower when the leader is the unique broadcasting process, since it is the most overloaded. Case (iii) has a better throughput than case (iv) after 12 processes since the leader does not broadcast and can allow more processing power than case (iv). This shows that broadcasting messages slows down a process, and this is also verified by the increased throughput when another process than the leader (case ii) is broadcasting. 17 Figure 24 compares Fast Paxos in two different modes: (i) concurrent consensus instances are started, and (ii) only consecutive consensus instances are launched. Not to overwhelm the process with context switching, Paxos is implemented using a thread pool that is limited to ten, i.e., at most ten concurrent consensus run at each process. The 17 When

increasing the number of processes, the performances come close to each other because the capacity of Paxos is reached.

38


70

45

concurrent consecutive

concurrent consecutive

40

60

# of TO_Deliver/sec

# of TO_Deliver/sec

35 50 40 30

30 25 20 15

20 10 10

5

0

0 50

100

150 200 total # of TO_Broadcast

250

300

50

100

(a) 3 processes

35

150 200 total # of TO_Broadcast

250

300

(b) 6 processes

25

concurrent consecutive

concurrent consecutive

30

# of TO_Deliver/sec

# of TO_Deliver/sec

20 25 20 15

15

10

10 5 5 0

0 50

100

150 200 total # of TO_Broadcast

250

300

50

(c) 10 processes

100

150 200 total # of TO_Broadcast

(d) 20 processes

Figure 24. Concurrent vs consecutive (Fast Paxos)

39

250

300


throughput in both modes decreases as the number of protocol instances increases. At first, the concurrent version gives better performance, but this diminishes as the number of broadcast increases. In fact, the increasing computation needed (in the task launch) impedes the performance of the concurrent version, i.e., performance degrades. The results show that the more process a system has, the less difference there is in throughput between consecutive and concurrent executions, i.e, when there are more processes in the system, there are less consensus instances that are launched. Figure 25 depicts the broadcast rate at which the best throughput can be achieved from 4 to 10 processes. For all cases, the throughput increases (approximately) linearly until a certain point, e.g., up to 10 broadcast/sec/process for a six processes system and then the throughput falls suddenly linearly. Above the breakpoint, the leader again becomes the bottleneck, its task receive is overwhelmed by the number of broadcasts it has to handle, thus delaying new protocol instances.

70

60

50

50

# of TO_Deliver/sec

# of TO_Deliver/sec

60

40 30

40

30

20

20 10

10 0

0 10

12

14

16 18 20 22 TO_Broadcast/sec/process

24

26

28

7

8

9 10 11 TO_Broadcast/sec/process

(a) 4 processes

12

13

(b) 6 processes

70

# of TO_Deliver/sec

60 50 40 30 20 10 0 4

5

6 7 8 TO_Broadcast/sec/process

9

10

(c) 10 processes

Figure 25. Best throughput (Fast Paxos) Figure 26(a) depicts the impact of forced logs for the Fast Paxos algorithm. When forced logs are removed, the increased performance is minimal since the algorithm is fine-tuned and waits for a certain number of broadcast messages before launching a consensus. The TO-Delivery rate is by far better when a consensus is launched for a certain

40


number of messages rather than starting a consensus for each single broadcast message. The number of consensus becomes too big and slows down the algorithm. Due to this optimisation, there are few instances of consensus per second and hence few stable storage access per second. Therefore, upon removal of stable storage, the performance improvement is not drastic as one might think. This result shows that the winter season protocol is not really useful for a practical system. 18 However, Figure 26(b) shows that forced logs have an impact on performance. If Fast Paxos launches a large number of consensus per second, i.e., a consensus is started consecutively for each single broadcast message. (There are no other consensus instance running in parallel, but there can be many consensus instances per second.) In this case, the impact of forced logs is quite significant, as shown in Figure 26(b).

60

50

without SS with SS

50

40 # of TO_Deliver/sec

# of TO_Deliver/sec

without stable storage with stable storage

45

40

30

20

35 30 25 20 15 10

10

5 0

0 0

5

10

15 # of processes

20

25

30

0

5

(a) Concurrent execution

10

15 # of processes

20

25

30

(b) Consecutive execution

Figure 26. Comparison between forced logs and no stable storage (Fast Paxos) Finally, Figure 27 gives the recovery time required by a process depending on the number of messages retrieved from the stable storage. The number of retrieved messages is proportional to the number of reads from the disk, thus increasing the recovery time. 5000 4500

time for recovery (ms)

4000 3500 3000 2500 2000 1500 1000 500 0 0

200

400

600

800

1000

1200

# of messages retrieved from stable storage

Figure 27. Recovery time

18 Moreover, Note that for a long-lived application, this model is not really practical, since every process is likely to crash and recover at least once during the life of the application.

41


B

Optional Appendix. Implementation of Ω in a Crash-Recovery Model with partial synchrony Figure 28 gives the implementation of the failure detector Ω in a crash-recovery model with partial synchrony

assumptions. We assume that message communication times are bound by an unknown period but hold after some global stabilisation time. Intuitively, the algorithm works as follows. A process p i keeps track of the processes that it trusts in a set denoted trustlist. A process p i keeps on sending I - AM - ALIVE messages to every process. Periodically, pi removes of its trustlist the processes from which it did not receive, within a certain threshold, any I - AM - ALIVE message. When pi receives an I - AM - ALIVE message from some process p j and if pj was not part of the trustlist, p i then adds pj to its trustlist and increments pj ’s threshold. However, an unstable process can be trusted, therefore the algorithm counts the number of times that a process crashes and recovers. This scheme allows a process to detect when a process crashes and recovers, an unstable process has an unbouded epoch number at a correct process, while a correct process has an epoch number that stops increasing. When p i crashes and recovers, p i sends a RECOVERED message to every process (line 8). When p j receives a RECOVERED message from p i , pj updates the epoch number of pi at line 21 and p j adds pi to its trustlist. Variable Ω.trustlist contains the process, within the trustlist, that has the lowest epoch number (line 15), and if several of these exist, select the one with the lowest id. Processes exchange their epoch number and take the maximum of all epoch numbers to prevent the following case. Assume that processes p2 , p3 , p4 never crash and that process p 1 crashes and recovers. When p 1 recovers, assume that every process except p 1 receives the RECOVERED message from p 1 . Therefore, p 1 has epochp1 = 0, 0, 0, 0, while the other processes have epoch p2,3,4 = 1, 0, 0, 0. Each process has the same trustlist, indeed Ω p1 outputs p1 and Ωp2,3,4 outputs p2 which violates the property of Ω, exchanging their epoch number and taking the maximum such case is avoided. Therefore, when receiving the trustlist, p i also takes the maximum between its epoch number and the one it received from p j . Note that the MIN function gives the first index that realises the minimum. Proposition 62. The algorithm of Figure 28 satisfies the following property in a crash-recovery model with partial synchrony assumptions: There is a time after which exactly one correct process is always trusted by every correct process. Proof. There is a time after which every correct process stops crashing and remains always-up. Therefore, every correct process keeps on sending I - AM - ALIVE message to every process. Thanks to the partial synchrony assumptions, we know that after some global stabilisation time, a message does not take longer than a certain period of time to go from one process to another. Eventually, every process guesses this period of time by incrementing ∆ pi at line 19. By the fair loss property of the links, every correct process then receives an infinite number of times I - AM - ALIVE messages. Therefore, every correct process eventually has the same set trustlist and epoch list, indeed they output all the same process. Eventually, this process is correct since the algorithm chooses the process with the lowest epoch number (remember that an unstable process has a non decreasing epoch number at a correct process).

42


1: for each process pi : 2: upon initialisation or recovery do 3: Ω.trustlist ← ⊥; trustlistpi ← Π 4: for all pj ∈ Π do 5: ∆pi [pj ] ← default time-out interval 6: epochpi [pj ] ← 0 7: start task{updateD} 8: if recovery then send(RECOVERED) to all 9: task updateD 10: repeat periodically 11: send (I - AM - ALIVE ,epochpi ) to all processes 12: for all pj ∈ Π do 13: if pj ∈ trustlistpi and pi did not receive I - AM - ALIVE from pj during the last ∆pi [pj ] then 14: trustlistpi ← trustlistpi \ {pj } 15: Ω.trustlist ← MIN(pk ∈ trustlistpi | pk = MIN(epochpi )) 16: upon receive m from pj do 17: if m = (I - AM - ALIVE ,epochpj ) then 18: if pj ∈ trustlistpi then 19: trustlistpi ← trustlistpi ∪ {pj }; ∆pi [pj ] ← ∆pi [pj ] + 1 20: for all pk ∈ Π do 21: epochpi [pk ] ← MAX(epochpj [pk ], epochpi [pk ]) 22: else if m = RECOVERED then 23: epochpi [pj ] ← epochpi [pj ] + 1; trustlistpi ← trustlistpi ∪ {pj }

Figure 28. Implementing Ω in a crash-recovery model with partial synchrony assumptions

43


C OV ER F E AT U RE

CAP and Cloud Data Management Raghu Ramakrishnan, Yahoo

Novel systems that scale out on demand, relying on replicated data and massively distributed architectures with clusters of thousands of machines, particularly those designed for real-time data serving and update workloads, amply illustrate the realities of the CAP theorem.

T

he relative simplicity of common requests in Web data management applications has led to dataserving systems that trade off some of the query and transaction functionality found in traditional database systems to efficiently support such features as scalability, elasticity, and high availability. The perspective described here is informed by my experience with Yahoo’s PNUTS (Platform for Nimble Universal Table Storage) data-serving platform, which has been in use since 2008.1 As of 2011, PNUTS hosted more than 100 applications that support major Yahoo properties running on thousands of servers spread over 18 datacenters worldwide, with adoption and usage growing rapidly.2 The PNUTS design was shaped by the reality of georeplication—accessing a copy across a continent is much slower than accessing it locally—and we had to face the tradeoff between availability and consistent data access in the presence of partitions. It is worth noting, however, that the realities of slow access lead programmers to favor local copies even when there are no partitions. Thus, while the CAP theorem limits the consistency guarantees programmers can offer during partitions, they often make do with weaker guarantees even during normal operation, especially on reads.

0018-9162/12/$31.00 © 2012 IEEE

BACKGROUND: ACID AND CONSISTENCY Database systems support the concept of a transaction, which is informally an execution of a program. While the systems execute multiple programs concurrently in interleaved fashion for high performance, they guarantee that the execution’s result leaves the database in the same state as some serial execution of the same transactions. The term ACID denotes that a transaction is atomic in that the system executes it completely or not at all; consistent in that the database remains unchanged; isolated in that the effects of incomplete execution are not exposed; and durable in that results from completed transactions survive failures. The transaction abstraction is one of the great achievements of database management systems, freeing programmers from concern about other concurrently executing programs or failures: they simply must ensure that their program keeps the database consistent when run by itself to completion. The database system usually implements this abstraction by obtaining locks when a transaction reads or writes a shared object, typically according to a two-phase locking regimen that ensures the resulting executions are equivalent to some serial execution of all transactions. The system first durably records all changes to a write-ahead log, which allows undoing incomplete transactions, if need be, and restores completed transactions after failures. In a distributed database, if a transaction modifies objects stored at multiple servers, it must obtain and hold locks across those servers. While this is costly even if the servers are collocated, it is more costly if the servers are in different datacenters. When data is replicated, everything becomes even more complex because it is necessary to ensure that the surviving nodes in a failure scenario

Published by the IEEE Computer Society

FEBRUARY 2012

43


C OV ER F E AT U RE can determine the actions of both completed transactions (which must be restored) and incomplete transactions (which must be undone). Typically, the system can achieve this by using a majority protocol (in which writes are applied to most of the copies, or quorum, and a quorum member serves the reads). In addition to the added costs incurred during normal execution, these measures can force a block during failures that involve network partitions, compromising availability, as the CAP theorem describes.3,4 Both the database and distributed systems literature offer many alternative proposals for the semantics of concurrent operations. Although the database notions of consistency apply to a distributed setting (even though they can be more expensive to enforce and might introduce availability tradeoffs), they were originally designed to

Systems must serve requests with low latency to users worldwide, throughput is high, and applications must be highly available, all at minimal ongoing operational costs.

allow interleaving of programs against a centralized database. Thus, the goal was to provide a simple programming abstraction to cope with concurrent executions, rather than to address the challenges of a distributed setting. These differences in setting have influenced how both communities have approached the problem, but the following two differences in perspective are worth emphasizing: ••

••

Unit of consistency. The database perspective, as exemplified by the notion of ACID transactions, focuses on changes to the entire database, spanning multiple objects (typically, records in a relational database). The distributed systems literature generally focuses on changes to a single object.5 Client- versus data-centric semantics. The database community’s approach to defining semantics is usually through formalizing the effect of concurrent accesses on the database; again, the definition of ACID transactions exemplifies this approach—the effect of interleaved execution on the database must be equivalent to that of some serial execution of the same transactions. But the distributed systems community often takes a client-centric approach, defining consistency levels in terms of a client that issues reads and writes sees (potentially) against a distributed data store in the presence of other concurrently executing clients.

The notions of consistency proposed in the distributed systems literature focus on a single object and are client-

44

COMPUTER

centric definitions. Strong consistency means that once a write request returns successfully to the client, all subsequent reads of the object—by any client—see the effect of the write, regardless of replication, failures, partitions, and so on. Observe that strong consistency does not ensure ACID transactions. For example, client A could read object X once, and then read it again later and see the effects of another client’s intervening write because this is not equivalent to a serial execution of the two clients’ programs. That said, implementing ACID transactions ensures strong consistency. The term weak consistency describes any alternative that does not guarantee strong consistency for changes to individual objects. A notable instance of weak consistency is eventual consistency, which is supported by Amazon’s Dynamo system,6 among others.1,5 Intuitively, if an object has multiple copies at different servers, updates are first applied to the local copy and then propagated out; the guarantee offered is that every update is eventually applied to all copies. However, there is no assurance of the order in which the system will apply the updates—in fact, it might apply the updates in different orders on different copies. Unless the nature of the updates makes the ordering immaterial—for example, commutative and associative updates—two copies of the same object could differ in ways that are hard for a programmer to identify. Researchers have proposed several versions of weak consistency,5 including •• ••

••

read-your-writes—a client always sees the effect of its own writes, monotonic read—a client that has read a particular value of an object will not see previous values on subsequent accesses, and monotonic write—all writes a client issues are applied serially in the issued order.

Each of these versions can help strengthen eventual consistency in terms of the guarantees offered to a client.

CLOUD DATA MANAGEMENT Web applications, a major motivator for the development of cloud systems, have grown rapidly in popularity and must be able to scale on demand. Systems must serve requests with low latency (tens of milliseconds) to users worldwide, throughput is high (tens of thousands of reads and writes per second), and applications must be highly available, all at minimal ongoing operational costs. Fortunately, full transactional support typically is not required, and separate systems perform complex analysis tasks—for example, map-reduce platforms such as Hadoop (http:// hadoop.apache.org). For many applications, requests are quite simple compared to traditional data management settings—the data


might be user session data, with all user actions on a webpage written to and read from a single record, or it might be social, with social activities written to a single user record, and a user’s friends’ activities read from a small number of other user records. These challenges have led to the development of a new generation of analytic and serving systems based on massively distributed architectures that involve clusters of thousands of machines. All data is routinely replicated within a datacenter for fault tolerance; sometimes the data is even georeplicated across multiple datacenters for low-latency reads. Massively distributed architectures lend themselves to adding capacity incrementally and on demand, which in turn opens the door to building multitenanted, hosted systems with several applications sharing underlying resources. These cloud systems need not be massively distributed, but many current offerings are, such as those from Amazon,6 Google,7,8 Microsoft,9 Yahoo,1 and the Cassandra (http://cassandra.apache.org) and HBase (http://hbase.apache.org) open source systems. Although Web data management provided the original motivation for massively distributed cloud architectures, these systems are also making rapid inroads in enterprise data management. Furthermore, the rapid growth in mobile devices with considerable storage and computing power is leading to systems in which the number of nodes is on the order of hundreds of millions, and disconnectivity is no longer a rare event. This new class of massively distributed systems is likely to push the limits of how current cloud systems handle the challenges highlighted by the CAP theorem. The belief that applications do not need the greater functionality of traditional database systems is a fallacy. Even for Web applications, greater functionality simplifies the application developer’s task, and better support for data consistency is valuable: depending on the application, eventual consistency is often inadequate, and sometimes nothing less than ACID will suffice. As the ideas underlying cloud data-serving systems find their way into enterprise-oriented data management systems, the fraction of applications that benefit from (indeed, require) higher levels of consistency and functionality will rise sharply. Although it is likely that some fundamental tradeoffs will remain, we are witnessing an ongoing evolution from the first generation of cloud data-serving systems to increasingly more complete systems. Bigtable and HBase are systems that write synchronously to all replicas, ensuring they are all always up to date. Dynamo and Cassandra enforce that writes must succeed on a quorum of servers before they return success to the client. They maintain record availability during network partitions, but at the cost of consistency because they do not insist on reading from the write quorum. Megastore comes closer to the consistency of a traditional

DBMS, supporting ACID transactions (meant to be used on records within the same group; it uses Paxos for synchronous replication across regions. (Note that because new systems are being announced in this space at a rapid pace, this is not meant to be a comprehensive survey.)

PNUTS: A CASE STUDY Yahoo has 680 million customers and numerous internal platforms with stringent latency requirements (fewer than 10 ms is common). Servers can fail, and individual datacenters can suffer network partitions or general shutdown due to disaster, but data must remain available under any failure conditions, which is achieved via replication at datacenters. We developed PNUTS to support CRUD— create, retrieve, update, delete—workloads in this setting.

The belief that applications do not need the greater functionality of traditional database systems is a fallacy.

Many applications have moved to PNUTS either from pure LAMP (Linux, Apache, MySQL, PHP) stacks or from other legacy key-value stores. Illustrative applications include Yahoo’s user location, user-generated content, and social directory platforms; the Yahoo Mail address book; Yahoo Answers, Movies, Travel, Weather, and Maps applications; and user profiles for ad and content personalization. The reasons for adopting PNUTS include flexible records and schema evolution; the ability to efficiently retrieve small ranges of records in order (for example, comments by time per commented-upon article); notifications of changes to a table; hosted multidatacenter storage; and above all, reliable, global, low-latency access. At Yahoo, the experience with PNUTS led to several findings: ••

••

••

The cloud model of hosted, on-demand storage with low-latency access and highly available multidatacenter replication has proved to be very popular. For many applications, users are willing to compromise on features such as complex queries and ACID transactions. Additional features greatly increase the range of applications that are easily developed by using the PNUTS system for data management, and, not surprisingly, the adoption for these applications. In particular, providing support for ordered tables—which allows arranging tables according to a composite key and enables efficient range scans—sparked a big increase in adoption, and we expect support for selective replication and secondary indexes to have a similar effect.

FEBRUARY 2012

45


C OV ER F E AT U RE Users have pushed for more options in the level of consistency. In the context of this article, the most relevant features are those that involve consistency.

Relaxed consistency PNUTS was one of the earliest systems to natively support geographic replication, using asynchronous replication to avoid long write latencies. Systems that make copies within the same datacenter have the option of synchronous replication, which ensures strong consistency. This is not viable for cross-datacenter replication, which requires various forms of weak consistency.

Most Web applications tend to write a single record at a time, and it is usually acceptable if subsequent reads of the record do not immediately see the write.

However, eventual consistency does not always suffice for supporting the semantics natural to an application. For example, suppose we want to maintain the state of a user logged in to Yahoo who wants to chat. Copies of this state might be maintained in multiple georegions and must be updated when the user decides to chat or goes offline. Consider what happens when the two regions are disconnected because of a link failure: when the link is restored, it is not sufficient for both copies to eventually converge to the same state; rather, the copies must converge to the state most recently declared by the user in the region where the user was most recently active. It is possible to update the copies via ACID transactions, but supporting ACID transactions in such a setting is a daunting challenge. Fortunately, most Web applications tend to write a single record at a time—such as changing a user’s chat status or home location in the profile record— and it is acceptable if subsequent reads of the record (for example, by a friend or user) do not immediately see the write. This observation is at the heart of the solution we adopted in PNUTS, called timeline consistency.

Timeline consistency In timeline consistency, an object and its replicas need not be synchronously maintained, but all copies must follow the same state timeline (possibly with some replicas skipping forward across some states). PNUTS does not allow objects to go backward in time or to not appear in the timeline associated with the object. The approach essentially is primary copy replication.1 At any given time, every object has exactly one master copy (in PNUTS, each record

46

COMPUTER

in a table is an object in this sense), and updates are applied at this master and then propagated to other copies, thereby ensuring a unique ordering of all updates to a record. Protocols for automatically recognizing master failures and transferring mastership to a surviving copy ensure high availability and support automated load-balancing strategies that transfer this mastership to the location where a record is most often updated. To understand the motivation behind this design, consider latency. Data must be globally replicated, and synchronous replication reduces latency unacceptably. Database systems support auxiliary data structures such as secondary indexes and materialized views, but maintaining such structures synchronously in a massively distributed environment further increases latency. The requirement of low-latency access therefore leads to asynchronous replication, which inherently compromises consistency.10 However, because object timelines provide a foundation that can support many read variants, applications that can tolerate some staleness can trade off consistency for performance, whereas those that require consistent data can rely on object timelines. Timeline consistency compromises availability, but only in those rare cases where the master copy fails and there is a partition or a failure in the messaging system that causes the automated protocol for transferring mastership to block. Additionally, timeline consistency weakens the notion of consistency because clients can choose to read older versions of objects even during normal operation. Again, this reflects a fundamental concern for minimizing latency, and is in the spirit of Daniel Abadi’s observation that the CAP theorem overlooks an important aspect of large-scale distributed systems—namely, latency (L). According to Abadi’s proposed reformulation,11 CAP should really be PACELC: If there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?

Selective record replication Many Yahoo applications have a truly global user base and replicate to many more regions than needed for fault tolerance. But while an application might be global, its records could actually be local. A PNUTS record that contains a user’s profile is likely only ever written and read in one or a few geographic regions where that user and his or her friends live. Legal issues also arise at Yahoo that limit where records can be replicated; this pattern typically follows user locality as well. To address this concern, we added per-record selective replication to PNUTS.12 Regions that do not have a full copy of a record still have a stub version with enough metadata


to know which regions contain full copies for forwarding requests. A stub is only updated either at record creation or deletion, or when the record’s replica location changes. Normal data updates are only sent to regions containing full copies of the record, saving bandwidth and disk space.

THE CASE FOR A CONSISTENCY SPECTRUM Cloud data management systems designed for real-time data serving and workload updates amply illustrate the realities of the CAP theorem: such systems cannot support strong consistency with availability in the presence of partitions. Indeed, such massively distributed systems might settle for weaker consistency guarantees to improve latency, especially when data is georeplicated. In practice, a programmer using such a system must be able to explicitly make tradeoffs among consistency, latency, and availability in the face of various failures, including partitions. Fortunately, several consistency models allow for such tradeoffs, suggesting that programmers should be allowed to mix and match them to meet an application’s needs. We organize the discussion to highlight two independent dimensions: the unit of data that is considered in defining consistency and the spectrum of strong to weak consistency guarantees for a given choice of unit.

Unit of consistency While the database literature commonly defines consistency in terms of changes to the entire database, the distributed systems literature typically considers changes to each object, independent of changes to other objects. These are not the only alternatives; intuitively, any collection of objects to which we can ensure atomic access and that the system replicates as a unit can be made the unit of consistency. For example, any collection of objects collocated on a single server can be a reasonable choice as the unit of consistency (from the standpoint of ensuring good performance), even in the presence of failures. One widely recognized case in which multirecord transactions are useful is an entity group, which comprises an entity and all its associated records. As an example, consider a user (the “entity”) together with all user-posted comments and photos, and user counters, such as number of comments. It is frequently useful to update the records in an entity group together, for example, by inserting a comment and updating the number-of-comments counter. Usually, an entity group’s size is modest, and a single server can accommodate one copy of the entire set of records. Google’s App Engine provides a way to define entity groups and operate on them transactionally; Microsoft’s Azure has a similar feature that allows record grouping via a partition key as well as transactional updates to records in a partition. The basic approach to implementing transactions over entity groups is straightforward and relies on

controlling how records are partitioned across nodes to ensure that all records in an entity group reside on a single node. Then, the system can invoke conventional database transaction managers without using cross-server locks or other expensive mechanisms. This model has two basic restrictions: first, the entity group must be small enough to fit on a single node; indeed, for effective load balancing, the size must allow many groups to fit on a single node. Second, the definition of an entity group is static and typically specifies a composite key over the record’s attributes. A recent proposal considers how to relax the second restriction and allow defining entity groups more generally and dynamically.13

The basic approach to implementing transactions over entity groups is straightforward and relies on controlling how records are partitioned across nodes to ensure that all records in an entity group reside on a single node.

A consistency spectrum We begin by discussing a spectrum of consistency across copies of a single object and then discuss how to generalize these ideas to handle other units of consistency. Consistency models for individual objects. Timeline consistency offers a simple programming model: copies of a record might lag the master copy, but the system applies all updates to every copy in the same order as the master. Note that this is a data-centric guarantee. From the client’s perspective, monotonic writes are guaranteed, so an object timeline—a timestamp generated at the master object that identifies each state and its position on the object’s timeline—can support several variants of the read operation, each with different guarantees: ••

••

Read-any. Any copy of the object can be returned, so if a client issues this call twice, the second call might actually see an older version of the object, even if the master copy is available and timeline consistency is enforced. Intuitively, the client reads a local copy that later becomes unavailable, and the second read is served from another (nonmaster) copy that is more stale. Critical-read. Also known as monotonic read, criticalread ensures that the copy read is fresher than any previous version the client sees. By remembering the last client-issued write, the critical-read operation can extend to support read-your-writes, although to make this more efficient, it might be necessary to additionally cache a client’s writes locally.

FEBRUARY 2012

47


C OV ER F E AT U RE •• ••

Read-up-to-date. To get the current version of the object, read-up-to-date accesses the master copy. Test-and-set. Widely used in PNUTS, test-and-set is a conditional write applied only if the version at the master copy when the write applied is unchanged from the version previously read by the client issuing the write. It is sufficient to implement single-object ACID transactions.

Timeline consistency over entity groups. A natural generalization of timeline consistency and entity groups is to consider entity group timelines rather than individual records. The timeline has a master copy of each entity group, rather than each record, and applies transactional updates to an entity group (possibly affecting multiple records) and at the master copy, just like individual updates of a single record in timeline consistency. The transaction sequence is then logged, asynchronously shipped to the sites with copies of the entity group, and reapplied at each such site.

Although massively distributed systems provide multiple abstractions to cope with consistency, programmers need to be able to mix and match these abstractions.

Although this generalization should be supportable with performance and availability characteristics comparable to timeline consistency and entity group consistency, I am not aware of any systems that (yet) do so. It seems an attractive option on the consistency spectrum, and covers many common applications that would otherwise require full ACID transactions.

Offering consistency choices Geographic replication makes all records always available for read from anywhere. However, anytime a distributed system is partitioned due to failures, it is impossible to preserve both write consistency and availability. One alternative is to support multiple consistency models and let the application programmer decide whether and how to degrade in case of failure. Eric Brewer suggests14 thinking in terms of a partition mode—how a client enters and exits this mode, and what it does while in partition mode and upon exit. Intuitively, a client enters the partition mode (due to a failure of some kind, triggered by a mechanism such as a time out) when it cannot complete a read or write operation with the desired level of consistency. The client must then operate with the recognition that it is seeing a version of the database that is

48

COMPUTER

not strongly consistent, and when emerging from partition mode (when the system resolves the underlying failure and signals this state in some way), the client must reconcile inconsistencies between the objects it has accessed. In the PNUTS implementation of timeline consistency, a client enters the partition mode when it attempts to write an object but the write is blocked because the master copy is unreachable, and the mastership transfer protocol also is blocked, typically because of a partition or site failure. At this point, the client should be able to choose to degrade to eventual consistency from timeline consistency and continue by writing another copy. However, the system is now in a mode in which the given object does not have a unique master. At some future point, the client must explicitly reconcile different versions of the object—perhaps using system-provided version vectors for the object—if the weaker guarantees of eventual consistency do not suffice for this object. PNUTS is not this flexible—it requires the programmer to choose between timeline consistency and eventual consistency at the level of a table of records. It treats each record in a table as an object with the associated consistency semantics. Inserts and updates can be made to any region at any time if eventual consistency is selected. For programmers who understand and can accept the eventual consistency model, the performance benefits are great: the system can perform all writes locally at the client, greatly improving write latencies. Given a server node failure, another (remote) node will always be available to accept writes. PNUTS can also enter partition mode if a read request requires access to the master copy; again, the client has the choice of waiting for the system to restore access or proceeding by reading an available copy. Furthermore, a client can choose to read any available copy if a slight lag from the master copy is acceptable even during normal operation. Although massively distributed systems provide multiple abstractions to cope with consistency, programmers need to be able to mix and match these abstractions. The discussion of per-object timeline consistency highlights how data- and client-centric approaches to defining consistency are complementary. The discussion of how to build on per-object timeline consistency to support different client-side consistency guarantees carries over to timeline consistency over entity groups. Indeed, this is a useful way to approach consistency in distributed systems in general: •• •• ••

decide on the units of consistency that the system is to support; decide on the consistency guarantees the system supports from a data-centric perspective; decide on the consistency guarantees the system supports from a client-centric perspective; and


••

expose these available choices to programmers through variations of create-read-write operations, so they can make the tradeoffs among availability, consistency, and latency as appropriate for their application.

thank the many people who contributed to it and to the many papers we wrote jointly. PNUTS is a collaboration between the systems research group and the cloud platform group at Yahoo. I also thank the anonymous referees and Daniel Abadi for useful feedback that improved the article, and Eric Brewer for sharing a preprint of his article that appears elsewhere in this issue.

For example, we could allow •• •• •• ••

the type of consistency desired—timeline or eventual—to be a property of a collection of objects; alternative forms of reading an object that support various client-centric consistency semantics; flexible definition of groups of objects to be treated as one object for consistency purposes; and specification of how to degrade gracefully to weaker forms of consistency upon failure.

Determining the right abstractions and the right granularities for expressing these choices requires further research and evaluation. It is time to think about programming abstractions that allow specifying massively distributed transactions simply and in a manner that reflects what the system can implement efficiently, given the underlying realities of latency in wide-area networks and the CAP theorem.

T

he tradeoff between consistency on one hand and availability/performance on the other has become a key factor in the design of large-scale data management systems. Although Web-oriented systems led the break from traditional relational database systems, the ideas have begun to enter the database mainstream, and over the next several years, cloud data management for enterprises will offer database administrators some of the same design choices. The key observation is that we have choices—not just between ACID transactions and full RDBMS capabilities on one side and NoSQL systems offering no consistency guarantees and minimal query and update capabilities on the other. We will see systems that are somewhere in the middle of this spectrum, striving to provide as much functionality as possible while satisfying the availability and performance demands of diverse application settings. Designing abstractions that cleanly package these choices, developing architectures that robustly support them, and optimizing and autotuning these systems will in turn provide research challenges for the next decade.

Acknowledgments In writing this article, I was strongly influenced by the experience gained in designing, implementing, and deploying the PNUTS system (also known as Sherpa within Yahoo), and

References 1. B.F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform,” Proc. VLDB Endowment (VLDB 08), ACM, 2008, pp. 1277-1288. 2. A. Silberstein et al., “PNUTS in Flight: Web-Scale Data Serving at Yahoo,” IEEE Internet Computing, vol. 16, no. 1, 2012, pp. 13-23. 3. E. Brewer, “Towards Robust Distributed Systems,” Proc. 19th Ann. ACM Symp. Principles of Distributed Computing. (PODC 00), ACM, 2000, pp. 7-10. 4. S. Gilbert and N. Lynch, “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services,” ACM SIGACT News, June 2002, pp. 51-59. 5. W. Vogels, “Eventually Consistent,” ACM Queue, vol. 6, no. 6, 2008, pp. 14-19. 6. G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” Proc. 21st ACM SIGOPS Symp. Operating Systems Principles (SOSP 07), ACM, 2007, pp. 205-220. 7. F. Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans. Computers, June 2008, article no. 4; doi:10.1145/1365815.1365816. 8. J. Baker et al., “Megastore: Providing Scalable, Highly Available Storage for Interactive Services,” Proc. Conf. Innovative Database Research (CIDR 11), ACM, 2011, pp. 223-234. 9. P.A. Bernstein et al., “Adapting Microsoft SQL Server for Cloud Computing,” Proc. IEEE 27th Int’l Conf. Data Eng. (ICDE 11), IEEE, 2011, pp. 1255-1263. 10. P. Agrawal et al., “Asynchronous View Maintenance for VLSD Databases,” Proc. 35th SIGMOD Int’l Conf. Management of Data, ACM, 2009, pp. 179-192. 11. D.J. Abadi, “Consistency Tradeoffs in Modern Distributed Database System Design,” Computer, Feb. 2012, pp. 37-42. 12. S. Kadambi et al., “Where in the World Is My Data?” Proc. VLDB Endowment (VLDB 2011), ACM, 2011, pp. 1040-1050. 13. S. Das, D. Agrawal, and A.E. Abbadi, “G-Store: A Scalable Data Store for Transactional Multi Key Access in the Cloud,” Proc. ACM Symp. Cloud Computing (SoCC 10), ACM, 2010, pp. 163-174. 14. E. Brewer, “Pushing the CAP: Strategies for Consistency and Availability,” Computer, Feb. 2012, pp. 23-29.

Raghu Ramakrishnan heads the Web Information Management Research group at Yahoo and also serves as chief scientist for cloud computing and search. Ramakrishnan is an ACM and IEEE Fellow and has received the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Packard Foundation Fellowship, and the Distinguished Alumnus Award from IIT Madras. Contact him at scyllawi@ yahoo.com.

Selected CS articles and columns are available for free at http://ComputingNow.computer.org.

FEBRUARY 2012

49


Distributed Systems