Issuu

A Course in Stochastic Game Theory Eilon

Solan

Visit to download the full and correct content document: https://ebookmeta.com/product/a-course-in-stochastic-game-theory-eilon-solan/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

A Course of Stochastic Analysis 1st Edition Alexander

Melnikov

https://ebookmeta.com/product/a-course-of-stochasticanalysis-1st-edition-alexander-melnikov/

An Introductory Course on Mathematical Game Theory and Applications 2nd Edition González-Díaz

https://ebookmeta.com/product/an-introductory-course-onmathematical-game-theory-and-applications-2nd-edition-gonzalezdiaz/

A First Course in Spectral Theory 1st Edition Milivoje

Luki■

https://ebookmeta.com/product/a-first-course-in-spectraltheory-1st-edition-milivoje-lukic/

A First Course in Group Theory 1st Edition Bijan Davvaz

https://ebookmeta.com/product/a-first-course-in-group-theory-1stedition-bijan-davvaz/

Foundations and

Methods

of Stochastic Simulation A First Course 2nd Edition

Barry L. Nelson

https://ebookmeta.com/product/foundations-and-methods-ofstochastic-simulation-a-first-course-2nd-edition-barry-l-nelson/

A Course in Quantum Many Body Theory 1st Edition

Michele Fabrizio

https://ebookmeta.com/product/a-course-in-quantum-many-bodytheory-1st-edition-michele-fabrizio/

Stochastic Evolution Systems Linear Theory and Applications to Non Linear Filtering Probability Theory and Stochastic Modelling 89 Boris L. Rozovsky

https://ebookmeta.com/product/stochastic-evolution-systemslinear-theory-and-applications-to-non-linear-filteringprobability-theory-and-stochastic-modelling-89-boris-l-rozovsky/

Stochastic Processes Harmonizable Theory 1st Edition

M.M. Rao

https://ebookmeta.com/product/stochastic-processes-harmonizabletheory-1st-edition-m-m-rao/

Game Theory 1st Edition 50Minutes.Com

https://ebookmeta.com/product/game-theory-1st-edition-50minutescom/

ManagingEditor:IanJ.Leary,

MathematicalSciences,UniversityofSouthampton,UK

63Singularpointsofplanecurves,C.T.C.WALL

64AshortcourseonBanachspacetheory,N.L.CAROTHERS

65ElementsoftherepresentationtheoryofassociativealgebrasI,IBRAHIMASSEM,DANIEL SIMSON&ANDRZEJSKOWRO ´ NSKI

66Anintroductiontosievemethodsandtheirapplications,ALINACARMENCOJOCARU &M.RAMMURTY

67Ellipticfunctions,J.V.ARMITAGE&W.F.EBERLEIN

68Hyperbolicgeometryfromalocalviewpoint,LINDAKEEN&NIKOLALAKIC

69LecturesonKahlergeometry,ANDREIMOROIANU

70Dependencelogic,JOUKUVAANANEN

71ElementsoftherepresentationtheoryofassociativealgebrasII,DANIELSIMSON&ANDRZEJ SKOWRO ´ NSKI

72ElementsoftherepresentationtheoryofassociativealgebrasIII,DANIELSIMSON&ANDRZEJ SKOWRO ´ NSKI

73Groups,graphsandtrees,JOHNMEIER

74RepresentationtheoremsinHardyspaces,JAVADMASHREGHI

75Anintroductiontothetheoryofgraphspectra,DRAGO ˇ SCVETKOVI ´ C,PETERROWLINSON &SLOBODANSIMI ´ C

76NumbertheoryinthespiritofLiouville,KENNETHS.WILLIAMS

77Lecturesonproﬁnitetopicsingrouptheory,BENJAMINKLOPSCH,NIKOLAYNIKOLOV &CHRISTOPHERVOLL

78Cliffordalgebras:Anintroduction,D.J.H.GARLING

79IntroductiontocompactRiemannsurfacesanddessinsd’enfants,ERNESTOGIRONDO& GABINOGONZ ´ ALEZ–DIEZ

80TheRiemannhypothesisforfunctionﬁelds,MACHIELVANFRANKENHUIJSEN

81Numbertheory,Fourieranalysisandgeometricdiscrepancy,GIANCARLOTRAVAGLINI

82Finitegeometryandcombinatorialapplications,SIMEONBALL

83Thegeometryofcelestialmechanics,HANSJORGGEIGES

84Randomgraphs,geometryandasymptoticstructure,MICHAELKRIVELEVICH etal

85Fourieranalysis:PartI–Theory,ADRIANCONSTANTIN

86Dispersivepartialdifferentialequations,M.BURAKERDO ˘ GAN&NIKOLAOSTZIRAKIS

87Riemannsurfacesandalgebraiccurves,R.CAVALIERI&E.MILES

88Groups,languagesandautomata,DEREKF.HOLT,SARAHREES&CLAASE.ROVER

89AnalysisonPolishspacesandanintroductiontooptimaltransportation,D.J.H.GARLING

90Thehomotopytheoryof (∞, 1)-categories,JULIAE.BERGNER

91TheblocktheoryofﬁnitegroupalgebrasI,MARKUSLINCKELMANN

92TheblocktheoryofﬁnitegroupalgebrasII,MARKUSLINCKELMANN

93Semigroupsoflinearoperators,DAVIDAPPLEBAUM

94Introductiontoapproximategroups,MATTHEWC.H.TOINTON

95RepresentationsofﬁnitegroupsofLietype(2ndEdition),FRANC¸OISDIGNE&JEANMICHEL

96TensorproductsofC*-algebrasandoperatorspaces,GILLESPISIER

97Topicsincyclictheory,DANIELG.QUILLEN&GORDONBLOWER

98Fasttracktoforcing,MIRNAD ˇ ZAMONJA

99Agentleintroductiontohomologicalmirrorsymmetry,RAFBOCKLANDT

100Thecalculusofbraids,PATRICKDEHORNOY

101Classicalanddiscretefunctionalanalysiswithmeasuretheory,MARTINBUNTINAS

102NotesonHamiltoniandynamicalsystems,ANTONIOGIORGILLI

ACourseinStochasticGameTheory

EILONSOLAN

Tel-AvivUniversity

UniversityPrintingHouse,CambridgeCB28BS,UnitedKingdom OneLibertyPlaza,20thFloor,NewYork,NY10006,USA

477WilliamstownRoad,PortMelbourne,VIC3207,Australia 314–321,3rdFloor,Plot3,SplendorForum,JasolaDistrictCentre, NewDelhi–110025,India

103PenangRoad,#05–06/07,VisioncrestCommercial,Singapore238467 CambridgeUniversityPressispartoftheUniversityofCambridge. ItfurtherstheUniversity’smissionbydisseminatingknowledgeinthepursuitof education,learning,andresearchatthehighestinternationallevelsofexcellence.

www.cambridge.org

Informationonthistitle: www.cambridge.org/9781316516331 DOI: 10.1017/9781009029704

©EilonSolan2022

Thispublicationisincopyright.Subjecttostatutoryexception andtotheprovisionsofrelevantcollectivelicensingagreements, noreproductionofanypartmaytakeplacewithoutthewritten permissionofCambridgeUniversityPress.

Firstpublished2022

AcataloguerecordforthispublicationisavailablefromtheBritishLibrary.

LibraryofCongressCataloging-in-PublicationData

Names:Solan,Eilon,author.

Title:Acourseinstochasticgametheory/EilonSolan, SchoolofMathematicalSciences.

Description:Firstedition. | NewYork:CambridgeUniversityPress,2022.|

Subjects:LCSH:Gametheory. | BISAC:MATHEMATICS/General Classiﬁcation:LCCQA269.S652022(print) | LCCQA269(ebook) | DDC519.3–dc23/eng20220207 LCrecordavailableat https://lccn.loc.gov/2021055382 LCebookrecordavailableat https://lccn.loc.gov/2021055383

ISBN978-1-316-51633-1Hardback

ISBN978-1-009-01479-3Paperback

CambridgeUniversityPresshasnoresponsibilityforthepersistenceoraccuracyof URLsforexternalorthird-partyinternetwebsitesreferredtointhispublication anddoesnotguaranteethatanycontentonsuchwebsitesis,orwillremain, accurateorappropriate.

Thebookisdedicatedtothepeoplewhosharedmylife,mywork,andmylove ofgamesinthelastthirtyyears.ToAbrahamNeyman,whointroducedmeto gametheoryandtostochasticgames;toNicolasVieilleandDinahRosenberg, withwhomIspentcountlessfunweeksofstudyingstochasticgames;toEhud Lehrer,whohasbeenmycolleagueandpartneratTelAvivUniversityforthe past20years;tomyparents,ChaimandZafrira;andtomytwosons,Omri andRon,wholistenedtogame-theoreticproblemssincebirthandeventually becamecoauthors.

5Two-PlayerZero-SumDiscountedGames

6Semi-AlgebraicSetsandtheLimitoftheDiscountedValue

6.1Semi-AlgebraicSets

7.5ContinuityoftheValue

Introduction

Stochasticgamesareamathematicalmodelthatisusedtostudydynamic interactionsamongagentswhoinfluencetheevolutionoftheenvironment. ThesegameswerefirstpresentedandstudiedbyLloydShapley(1953).1 , 2 SinceShapley’sseminalwork,theliteratureonstochasticgamesexpanded considerably,andthemodelwasappliedtonumerousareas,suchasarmsrace, fisherywars,andtaxation.

Astochasticgameisplayedindiscretetimebyaﬁniteset I ofplayers,and itconsistsofaﬁnitenumberofstates.Ineachstate s ,eachplayer i ∈ I hasa givensetofactions,denoted Ai (s).Ineverystage t ∈ N,theplayisinoneof thestates,denoted st .Eachplayer i ∈ I choosesanaction a i t ∈ Ai (st ) thatis availabletoheratthecurrentstage,receivesastagepayoff,whichdependson thecurrentstate st aswellasontheactions (a j t )j ∈I chosenbytheplayers,and anewstate st +1 ischosen,accordingtoaprobabilitydistributionthatdepends onthecurrentstateandontheactionsoftheplayers (a j t )j ∈I

Inastochasticgame,theplayershavetwo,seeminglycontradicting,goals. First,theyneedtoensurethattheirfutureopportunitiesremainhigh.Atthe sametime,theyshouldmakesurethattheirstagepayoffisalsohigh.This dichotomymakestheanalysisofstochasticgamesintriguingandnottrivial.

Thestudyofstochasticgamesusestoolsfrommanymathematicalbranches, suchasprobability,analysis,algebra,differentialequations,andcombinatorics.Thegoalofthisbookistopresentthetheorythroughthemathematical techniquesthatitemploys.Thus,eachchapterpresentsmathematicalresults

1 LloydStowellShapley(Cambridge,Massachusetts,June2,1923–Tucson,Arizona,March12, 2016)wasanAmericanmathematicianwhomademanyinﬂuentialcontributionstoGame Theory,liketheShapleyvalue,stochasticgames,andthedefer-acceptancealgorithmforstable marriages.Shapleysharedthe2012NobelPrizeinEconomicstogetherwithgametheorist AlvinRoth.

2 AllcommentaryistakenfromWikipedia.

fromsomebranchofmathematics,andusesthemtoproveresultsonstochastic games.Thegoalisnottoprovethemostgeneraltheoremsinstochasticgames, butrathertopresentthebeautyofthetheory.Accordingly,wesometimes restrictthescopeoftheresultsthatisproven,toallowforsimplerproofsthat bypasstechnicaldifﬁculties.

Thematerialinthisbookissummarizedbythefollowingtable:

ChapterTool + Result

1 Contractingmappings

StationaryoptimalstrategiesinMarkovdecisionproblems

2 TauberianTheorem

Uniform -optimalityinhiddenMarkovdecisionproblems

5 Contractingmappings

Stationarydiscountedoptimalstrategiesinzero-sumstochastic games

6 Semi-algebraicmappings

Existenceofthelimitofthediscountedvalue

7 B -graphs

Continuityofthelimitofthediscountedvalue

8 Kakutani’sﬁxedpointtheorem

Stationarydiscountedequilibriainmultiplayerstochasticgames

9 Existenceoftheuniformvalueinzero-sumstochasticgames

10 Thevanishingdiscountfactorapproach

Existenceofuniformequilibriuminabsorbinggames

11 Ramsey’sTheorem

Existenceofundiscountedequilibriumintwo-playerdeterministic stoppinggames

12 Approximatinginﬁniteorbits

Existenceofundiscountedequilibriuminmultiplayerquitting games

13 Linearcomplementarityproblems

Existenceofundiscountedequilibriuminmultiplayerquitting games

Eachchaptercontainsexercises.Solutionsareavailableassupplementary materialonthebook’spageonthepublisher’swebsite.Thebookisbased onagraduatelevelcoursethatItaughtatTelAvivUniversityformorethan

adecade.Ihopethatthereaders,asmystudents,willlikethediversityofthe topicsandtheeleganceoftheproofs.Forthebeneﬁtofreaderswhowouldlike toexpandtheirknowledgeinstochasticgames,Iaddedreferencestorelated resultsattheendofeachchapter.Booksandsurveysthatincludematerial ondifferentaspectsofstochasticgamesincludeRaghavanetal.(1991), RaghavanandFilar(1991),FilarandVrieze(1997),Bas¸arandOlsder(1998), Mertens(2002),Vieille(2002),NeymanandSorin(2003),Solan(2008), Chatterjeeetal.(2009, 2013),ChatterjeeandHenzinger(2012),Larakiand Sorin(2015),Mertensetal.(2015),SolanandVieille(2015),Solanand Ziliotto(2016),Bas¸arandZaccour(2017),Ja ´ skiewiczandNowak(2018a,b), andRenault(2019).

IendtheintroductionbythankingAyalaMashiah-Yaakovi,whoreadthe manuscriptandthesolutionmanualandmademanycommentsthatimproved thetext;AndreiIacob,whocopyeditedthetext;andJohnYehudaLevy, AndrzejNowak,RobertSimon,BernhardvonStengel,UriZwick,andmy studentsthroughouttheyearsforprovidingcommentsandspottingtypos.

Notation

Thesetofpositiveintegersis

N :={1, 2, 3,... }

Thenumberofelementsinaﬁniteset K isdenotedby |K |.Foreveryﬁnite set K ,thesetofprobabilitydistributionsover K isdenotedby (K).We identifyeachelement k ∈ K withtheprobabilitydistributionin (K) that assignsprobability1to k .Foraprobabilitydistribution μ ∈ (K),the support of μ,denotedsupp(μ),isthesetofallelements k ∈ K thathavepositive probabilityunder μ: supp(μ) :={k ∈ K : μ[k ] > 0}.

Aprobabilitydistributionis pure ifsupp(μ) containsonlyoneelement: |supp(μ)|= 1.

Let I beaﬁniteset,and,foreach i ∈ I ,let Ai beaset.Wedenoteby AI := i ∈I Ai thecartesianproduct,anddenote A i := j ∈I \{i } Aj .Similarly,if a = (a i )i ∈I ∈ AI ,wedenoteby a i := (a j )j ∈I \{i } ∈ A i thevector a with its i ’thcoordinateremoved.

Wewillusetwonorms,the L1 -norm andthe L∞ -norm (orthemaximum norm).Foravector x ∈ Rn ,wedeﬁne

Forafunction f : X → R,argmaxx ∈X f(x) isthesetofallpointsin X that maximize f :

argmaxx ∈X f(x) := y ∈ X : f(y) = max x ∈X f(x) .

Whentheset X iscompactandthefunction f iscontinuous,theset argmaxx ∈X f(x) isnon-empty.

MarkovDecisionProblems

Inthischapter,weintroduceMarkovdecisionproblems,whicharestochastic gameswithasingleplayer.Theyserveasanappetizer.Ontheonehand, thebasicconceptsandbasicproofsforzero-sumstochasticgamesarebetter understoodinthissimplemodel.Ontheotherhand,someoftheconclusions thatwedrawforMarkovdecisionproblemsaredifferentfromthosedrawn forzero-sumstochasticgames.Thisillustratestheinherentdifference betweensingle-playerdecisionproblemsandmultiplayerdecisionproblems (=games).Theinterestedreaderisreferredto,forexample,Ross(1982)or Puterman(1994)foranexpositionofMarkovdecisionproblems.

Wewillstudyboththe T -stageevaluationandthediscountedevaluation. Wewillintroduceandstudycontractingmappings,1 andwillusesuch mappingstoshowthatthedecisionmakerhasastationarydiscountedoptimal strategy.Wewillalsodeﬁnetheconceptofuniformoptimality,andshowthat thedecisionmakerhasastationaryuniformlyoptimalstrategy.

Deﬁnition1.1 A Markovdecisionproblem2 isavector = S,(A(s))s ∈S , q,r where

• S isaﬁnitesetofstates.

•Foreach s ∈ S , A(s) isaﬁnitesetofactionsavailableatstate s .Thesetof pairs (state,action) isdenotedby

SA :={(s,a) : s ∈ S,a ∈ A(s)}.

• q : SA → (S) isatransitionrule.

• r : SA → R isapayofffunction.

1 Weadheretotheconventionthatamappingisafunctionwhoserangeisageneralspaceor Rn , whileafunctionisalwaysreal-valued.

2 AndreyAndreyevichMarkov(Ryazan,Russia,June14,1856–St.Petersburg,Russia,July20, 1922)wasaRussianmathematician.Heisbestknownforhisworkonthetheoryofstochastic processesthatnowbearhisname:MarkovchainsandMarkovprocesses.

AMarkovdecisionprobleminvolvesadecisionmaker,anditevolvesas follows.Theproblemlastsforinﬁnitelymanystages.Theinitialstate s1 ∈ S isgiven.Ateachstage t ≥ 1,thefollowinghappens:

•Thecurrentstate st isannouncedtothedecisionmaker.

•Thedecisionmakerchoosesanaction at ∈ A(st ) andreceivesthestage payoff r(st ,at ).

•Anewstate st +1 isdrawnaccordingto q(·| st ,at ),andthegameproceeds tostage t + 1.

Example1.2

Considerthefollowingsituation.Thetechnologicallevelofa countrycanbeHigh (H),Medium (M),orLow (L).Theannualinvestment ofthecountryintechnologicaladvancescanalsobehigh (2billiondollars), medium (1billiondollars),orlow (0.5billiondollars).Theannualgain fromtechnologicallevelisincreasing:thehigh,medium,andlowtechnologicallevelyield10,6,and2billiondollars,respectively.Thetechnological levelchangesstochasticallyasafunctionoftheinvestmentintechnologicaladvancement,accordingtothefollowingtable:3

HighMediumLow Technologylevelinvestmentinvestmentinvestment

ThesituationcanbepresentedasaMarkovdecisionproblemasfollows:

•Therearethreestates,whichrepresentthethreetechnologicallevels: S ={H,M,L}.

•Therearethreeactionsineachstate,whichrepresentthethreeinvestment levels: A(s) ={h,m,l } foreach s ∈ S .

•Thetransitionruleisgivenby

3 Hereandinthesequel,aprobabilitydistributionisdenotedbyalistofprobabilitiesand outcomesinsquarebrackets,wheretheoutcomesarewrittenwithinroundbrackets. Thus, 2 3 (H), 1 3 (M) meansaprobabilitydistributionthatassignsprobability 2 3 to H and probability 1 3 to M

q(H | H,h) = 1,q(M | H,h) = 0,q(L | H,h) = 0,

q(H | H,m) = 1 2 ,q(M | H,m) = 1 2 ,q(L | H,m) = 0,

q(H | H,l) = 1 4 ,q(M | H,l) = 3 4 ,q(L | H,l) = 0,

q(H | M,h) = 3 5 ,q(M | M,h) = 2 5 ,q(L | M,h) = 0,

q(H | M,m) = 0,q(M | M,m) = 1,q(L | M,m) = 0,

q(H | M,l) = 0,q(M | M,l) = 2 5 ,q(L | M,l) = 3 5 ,

q(H | L,h) = 0,q(M | L,h) = 3 5 ,q(L | L,h) = 2 5 ,

q(H | L,m) = 0,q(M | L,m) = 2 5 ,q(L | L,m) = 3 5 ,

q(H | L,l) = 0,q(M | L,l) = 0,q(L | L,l) = 1.

•Thepayofffunction (inbillionsofdollars) isgivenby

r(H,h) = 8,r(H,m) = 9,r(H,l) = 9 1 2 ,

r(M,h) = 4,r(M,m) = 5,r(M,l) = 5 1 2 ,

r(L,h) = 0,r(L,m) = 1,r(L,l) = 1 1 2 .

Example1.3 TheMarkovdecisionproblemthatisillustratedin Figure1.1 isformallydeﬁnedasfollows:

•Therearethreestates: S ={s(1),s(2),s(3)}.

•Instate s(1),therearetwoactions: A(s(1)) ={U,D };instates s(2) and s(3),thereisoneaction: A(s(2)) = A(s(3)) ={D }.

•Payoffsappearatthecenterofeachentryandaregivenby:

r(s(1),U) = 10; r(s(1),D) = 5; r(s(2),D) = 10; r(s(3),D) =−100.

•Transitionsappearinparenthesesnexttothepayoffandaregivenby:

– Ifinstate s(1) thedecisionmakerchooses U ,theprocessmovestostate s(2),thatis, q(s(2) | s(1),U) = 1.

– Ifinstate s(1) thedecisionmakerchooses D , theprocessremainsin state s(1),thatis, q(s(1) | s(1),D) = 1.

10(0, 1, 0)

5(1, 0, 0)

s(1)

100(0, 0, 1) DD

10 1 10 , 0, 9 10

s(2)s (3)

Figure1.1TheMarkovdecisionproblemin Example1.3

– Fromstate s(2),theprocessmovestostate s(1) withprobability 1 10 and tostate s(3) withprobability 9 10 ,thatis, q(s(1) | s(2),D) = 1 10 and q(s(3) | s(2),D) = 9 10 .

– Oncetheprocessreachesstate s(3),itstaysthere,thatis, q(s(3) | s(3),D) = 1.

1.1OnHistories

For t ∈ N,thesetof historiesoflength t isdefinedby Ht := (SA)t 1 × S, wherebyconvention (SA)0 =∅.Thisisthesetofallhistoriesthatmayoccur untilstage t .Atypicalelementin Ht isdenotedby ht .Thelaststateofhistory ht isdenotedby st .Theset H1 isidentifiedwiththestatespace S ,andthe history (s1 ) issimplydenotedby s1 . Wedenotethesetofall histories by H := t ∈N Ht , andthesetofall infinitehistories or plays by H∞ := (SA)N

Thesetofplays H∞ isameasurablespace,withthesigma-algebra generatedbythecylindersets,whicharedeﬁnedasfollows.Forahistory ht = (s1, a1,..., st ) ∈ Ht ,the cylinderset C(ht ) ⊂ H∞ isthecollectionof allplaysthatstartwith ht ,thatis,

C(ht ) :={h = (s1,a1,s2,a2,...) ∈ H∞

Forevery t ∈ N,thecollectionofallcylindersets (C(ht ))ht ∈Ht deﬁnesa ﬁnitepartition,oranalgebra,on H∞ .Wedenoteby Ht thisalgebraandby H thesigma-algebraon H∞ generatedbythealgebras (Ht )t ∈N .

1.2OnStrategies

A mixedaction atstate s isaprobabilitydistributionoverthesetofactions A(s) availableatstate s .Thesetofmixedactionsatstate s istherefore (A(s)).Astrategyofthedecisionmakerspeciﬁeshowthedecisionmaker shouldplayaftereachpossiblehistory.

Deﬁnition1.4 A strategy isamapping σ thatassignstoeachhistory h = (s1,a1,...,at 1,st ) amixedactionin (A(st )).

Thesetofallstrategiesisdenotedby .

Adecisionmakerwhofollowsastrategy σ behavesasfollows:ateach stage t ,giventhepasthistory (s1,a1,...,st ),thedecisionmakerchoosesan action at accordingtothemixedaction σ(·| s1,a1,...,st ).

Comment1.5 Astrategyasdeﬁnedin Deﬁnition1.4 istermedinthe literature behaviorstrategy.

Comment1.6

Thefactthatthechoiceofthedecisionmakerdependson pastplayimplicitlyassumesthatthedecisionmakerknowsthepastplay;that is,thedecisionmakerobserves(andremembers)allpaststatesthattheprocess visited,andsheremembersallherpastchoices.In Chapter2, wewillstudythe modelofMarkovdecisionproblemswhenthedecisionmakerdoesnotobserve thestate.

Comment1.7 Astrategycontainsalotofirrelevantinformation.Indeed, whentheinitialstateis s1 = s ,itisnotimportantwhatthedecisionmaker wouldplayiftheinitialstatewere s = s .Similarly,ifinthefirststagethe decisionmakerplayedtheaction a1 = a ,itisirrelevantwhatshewould playinthesecondstageifsheplayedtheaction a = a inthefirststage.We neverthelessregardastrategyasamappingdefinedonthesetof all histories, becauseofthesimplicityofthedefinition;otherwisewewouldhavetodefine foreverystrategy σ andeverypositiveinteger t thesetofallhistoriesoflength t thatcanoccurwithpositiveprobabilitywhenthedecisionmakerfollows strategy σ (whichdependonthedefinitionof σ uptostage t 1),anddefine σ atstage t onlyforthosehistories.

Everystrategy σ ,togetherwiththeinitialstate s1 ,definesaprobability distribution Ps1 ,σ onthespaceofmeasurablespace (H∞, H ).Todefinethis probabilitydistributionformally,wedefineitonthecollectionofcylindersets thatgenerate (H∞, H ) bytherule Ps1 ,σ

+1 | sk , ak )

Let Ps1 ,σ betheuniqueprobabilitydistributionon H∞ thatagreeswiththis deﬁnitiononcylindersets.Thefactthat,inthisway,weindeedobtaina uniqueprobabilitydistributionisguaranteedbytheCarath ´ eodory4 Extension Theorem(see,e.g.,theorem3.1inBillingsley(1995)).

4 ConstantinCarath ´ eodory(Berlin,Germany,September13,1873–Munich,Germany, February2,1950)wasaGreekmathematicianwhospentmostofhiscareerinGermany. Hemadesigniﬁcantcontributionstothetheoryoffunctionsofarealvariable,thecalculus ofvariations,andmeasuretheory.Hisworkalsoincludesimportantresultsinconformal representationsandinthetheoryofboundarycorrespondence.

Twosimpleclassesofstrategiesarepurestrategiesthatinvolvenorandomization,andstationarystrategiesthatdependonlyonthecurrentstateandnot onthewholepasthistory.

Deﬁnition1.8 Astrategy σ is pure if |supp(σ(ht ))|= 1foreveryhistory ht ∈ H .

Thesetofpurestrategiesisdenotedby P .

Deﬁnition1.9 Astrategy σ is stationary if,foreverytwohistories ht = (s1,a1,s2,...,at 1,st ) and hk = (s1, a1, s2,..., ak 1, sk ) thatsatisfy st = sk ,wehave σ(ht ) = σ(hk ).

Thesetofstationarystrategiesisdenotedby S .

Apurestationarystrategyassignstoeachstate s ∈ S anactionin A(s). Sincethenumberofactionsin A(s) is |A(s)|,wecanexpressthenumberof purestationarystrategiesintermsofthedataoftheMarkovdecisionproblem.

Theorem1.10 Thenumberofpurestationarystrategiesis s ∈S |A(s)|.

Onecanidentifyastationarystrategy σ withavector x ∈ s ∈S (A(s)). Withthisidentiﬁcation, x(s) isthemixedactionchosenwhenthecurrentstate is s .Thus,thesetofstationarystrategies S canbeidentiﬁedwiththespace X := s ∈S (A(s)),whichisconvexandcompact.Foreveryelement x ∈ X , thestationarystrategythatcorrespondsto x isstilldenotedby x .

In Deﬁnition1.4 wedeﬁnedastrategytobeamappingfromhistoriesto mixedactions.Wenowpresentanotherconceptofastrategythatinvolves randomization–amixedstrategy.

Definition1.11 A mixedstrategy isaprobabilitydistributionovertheset P ofpurestrategies. Everystrategyisequivalenttoamixedstrategy.Indeed,astrategy σ isdefinedby ℵ0 lotteries:toeachhistory ht ∈ H ,itassignsalottery σ(ht ) ∈ (A(st )).Ifthedecisionmakerperformsallthe ℵ0 lotteriesbefore theplaystarts,thentherealizationsofthelotteriesdefineapurestrategy.In particular,thestrategydefinesaprobabilitydistributionoverthesetofpure strategies.

Conversely,everymixedstrategyisequivalenttoastrategy.Indeed,given amixedstrategy τ ,onecancalculateforeachhistory ht theconditional probability σ(at | ht ) thattheactionchosenafter ht is at ∈ A(st ).Ifthehistory ht occurswithprobability0under Ps1 ,σ ,weset σ(at | ht ) arbitrarily.Onecan showthatthestrategy σ isequivalenttothemixedstrategy τ .

Theequivalencejustdescribedisaspecialcaseofamoregeneralresult called Kuhn’sTheorem;5 see,forexample,Maschler,Solan,andZamir(2020, chapter7).

1.3The T -StagePayoff

Thedecisionmakerreceivesthestagepayoff r(st ,at ) ateverystage t .How doesshecomparesequencesofstagepayoffs?Wewillstudytwomethods ofevaluations.Theﬁrst,whichweconsiderinthissection,isthe T -stage evaluation.Thisevaluationisrelevantwhentheprocesslasts T stages,and thegoalofthedecisionmakeristomaximizeherexpectedaveragepayoff duringthesestages.Thesecond,whichwewillstudyinthenextsection,isthe discountedevaluation,whichisrelevantwhentheplaycontinuesindeﬁnitely, andthegoalofthedecisionmakeristomaximizetheexpecteddiscountedsum ofherstagepayoffs.

Theexpectationoperatorfortheprobabilitydistribution Ps1 ,σ isdenotedby Es1 ,σ [ · ].Inparticular, Es1 ,σ [r(st ,at )]istheexpectedpayoffatstage t .

Deﬁnition1.12 Foreverypositiveinteger T ∈ N,everyinitialstate s1 ∈ S , andeverystrategy σ ∈ ,deﬁnethe T -stagepayoff by:

Example1.13 TheMarkovdecisionprobleminthisexampleisgivenin Figure1.2.

Theinitialstateis s(1).Wewillcalculatethe T -stagepayoffofeverypure strategy.

Figure1.2TheMarkovdecisionproblemin Example1.13

5 HaroldWilliamKuhn(SantaMonica,California,July29,1925–NewYorkCity,NewYork, July2,2014)wasanAmericanmathematician.HeisknownfortheKarush–Kuhn–Tucker conditions,forKuhn’stheorem,andfordevelopingKuhnpokeraswellasthedescriptionofthe Hungarianmethodfortheassignmentproblem.

Thestrategy σD thatalwaysplays D yieldsapayoff5ateverystage,and thereforeits T -stagepayoffis5aswell:

γT (s(1); σD ) = 5, ∀T ∈ N.

Thestrategy σU thatplays U intheﬁrststageyields10intheﬁrststageand2 inallsubsequentstages.Therefore,

(s(1); σU ) = 10

Forevery0 ≤ t<T ,thestrategy σDt U thatplays D intheﬁrst t stagesand U instage t + 1yields5intheﬁrst t stages,10instage t + 1,and2inall subsequentstages.Therefore,

(s(1); σDt U ) =

Deﬁnition1.14 Let s ∈ S andlet T ∈ N.Therealnumber vT (s) isthe T -stagevalueattheinitialstate s if

Anystrategyinargmaxσ ∈ γT (s ; σ) is T -stageoptimalat s

Inotherwords,the T -stagevalueat s isthemaximalamountthatthedecision makercangetwhentheinitialstateis s ,andastrategythatguaranteesthis quantityis T -stageoptimal.

Isthesupremumin Eq.(1.3) attained?Thatis,istherea T -stageoptimal strategy?As Theorem1.15 states,theanswerispositive.

Theorem1.15 Forevery s ∈ S andevery T ≥ 1,thereisa T -stageoptimal strategyattheinitialstate s .

Proof Inthe T -stagegame,theonlyrelevantpartofthestrategyisitsplay uptostage T .Inparticular,forthepurposeofstudyingthe T -stageproblem, wecandeﬁneastrategyasamapping σ : T t =1 Ht → s ∈S (A(s)), suchthat σ(ht ) ∈ (A(st )),foreveryhistory ht ∈ T t =1 Ht .Thissetisa compactsubsetofaEuclideanspace.Thepayofffunctioniscontinuousonthis set.Sinceacontinuousfunctiondeﬁnedonacompactsetattainsitsmaximum, theresultfollows.

Comment1.16 Wecanstrengthen Theorem1.15 andprovethat,forevery s ∈ S andevery T ≥ 1,thereisa T -stage pure optimalstrategyattheinitial state s (see Theorem1.18).Toseethis,considerthefunctionthatmapseach mixedstrategy σ intothe T -stagepayoff γT (s ; σ).Thisfunctionislinear. Indeed,let σ1 and σ2 betwostrategies,andlet σ3 bethefollowingstrategy: tossafaircoin;iftheresultisHead,follow σ1 ,whereasifitisTail,follow σ2 . Then

BytheKrein–Milman6 Theorem,alinearfunctionthatisdeﬁnedonacompact spaceattainsitsmaximumatanextremepoint.Sincethepurestrategiesare theextremepointsofthesetofmixedstrategies,itfollowsthatthefunction σ → γT (s ; σ) attainsitsmaximumatapurestrategy.

Example1.3, continued Thequantity γT (s(1); σDt U ) = 2T +3t +8 T is maximizedwhen t = T 1:thedecisionmakerplays T 1times D ,and thensheplays U once.Theresultingaveragepayoffis5 + 5 T .The T -stage valueattheinitialstate s(1) istherefore vT (s(1)) = 5 + 5 T .

Ingeneral,the T -stagevalue,aswellasthe T -stageoptimalstrategies,can befoundby backwardinduction,amethodthatisalsoknownasthe dynamic programmingprinciple.Wenowformalizethismethod.

Theorem1.17 Foreveryinitialstate s1 ∈ S andevery T ≥ 2,wehave

Eq.(1.4) statesthat,tocalculatethe T -stagevalue,wecanbreakthe problemintotwoparts:theﬁrststage,andthelast T 1stages.Since transitionsandpayoffsdependonlyonthecurrentstateandonthecurrent action,theproblemthatstartsatstage2isnotaffectedby s1 and a1 ,the stateandactionatstage1.Thisproblemisa (T 1)-stageMarkovdecision problem,whosevalue vT 1 (s2 ) dependsonitsinitialstate(andnotonthe initialstate s1 ).Tocalculatethe T -stagevalue,wecollapsethelast T 1 stagesintoasinglenumber,thevalueofthe (T 1)-stageproblemthatstarts

6 MarkGrigorievichKrein(Kiev,Russia,April3,1907–Odessa,Ukraine,October17,1989) wasaSovietmathematicianwhoisbestknownforhisworkinoperatortheory.

DavidPinhusovichMilman(Kiev,Russia,January15,1912–TelAviv,Israel,July12,1982) wasaSovietandlaterIsraelimathematicianspecializinginfunctionalanalysis.

atstage2,andweaskwhatistheoptimalactionintheﬁrststage,assuming thatifstate s2 isreachedatstage2,thecontinuationvalueis vT 1 (s2 ).

In Eq.(1.4) theweightofthepayoffinthefirststage, r(s1,a1 ),is 1 T ,and theweightofthevalueofthe (T 1)-stageproblemthatencapsulatesthe last T 1stagesis T 1 T .Whydowetaketheseweights?Thereasonisthat thequantity r(s1,a1 ) representsthepayoffinthefirststage,whilethequantity vT 1 (s2 ) capturestheaveragepayoffin T 1stages:stages2, 3,...,T .The weightsofeachofthetwoquantitiesreflectthispoint.

Toprove Theorem1.17, wewillconsiderconditionalexpectation.Recall that Es1 ,σ [r(st ,at )]istheexpectedpayoffatstage t .Forevery t ≤ t andeveryhistory ht = (s1, a1,..., st ) ∈ Ht with s1 = s1 ,thequantity Es1 ,σ [r(st ,at ) | ht ]istheexpectedpayoffatstage t ,conditionalthatthe history ht hasoccurred,thatis,conditionalthattheactionintheinitial stateis a1 ,thestateatstage2is s2 ,andsoon.Formally,foreveryhistory ht = (s1, a1,..., st ) ∈ Ht ,theprobabilitydistribution Ps1 ,σ (·| ht ) is deﬁnedasfollows:

•Forhistoriesthatarenotlongerthan ht :Forevery t ≤ t wehave

s1 ,σ (C(s1,a1,...,st ) | ht ) := 1{s1 =s1 ,a1 =a1 ,...,st =st } .

•Forhistoriesthatarelongerthan ht :Forevery t>t ,wehave Ps1 ,σ (C(s1,a1,...,st 1,at 1,st ) | ht ) := 1{s1 =s1 ,a1 =a1 ,...,st =st } t 1 k =t σ(ak | s1,a1,...,sk ) × t 1 k =t q(sk +1 | sk ,ak )

Denoteby Es1 ,σ [·| ht ]theexpectationwithrespectto Ps1 ,σ (·| ht ).

Proofof Theorem1.17 For T = 1,the T -stageproblemconcernstheﬁrst stageonly,and v1 (s1 ) = max a1 ∈A(s1 ) r(s1,a1 ).

Inparticular, Eq.(1.4) holds.For T ≥ 2,bydeﬁnitionandbythelawofiterated expectations,

vT (s1 )

Thetermwithinthemaximizationintheseequalitiesdependsonlyonthe partofthestrategy σ thatfollowstheinitialstate s1 .Thispartiscomposedofthemixedaction σ(s1 ) ∈ (A(s1 )) thatisplayedintheﬁrststage andthecontinuationstrategiesplayedfromthesecondstageandon.We denotethesecontinuationstrategiesby (σs1 ,a1 )a1 ∈A(s1 ) .Formally,forevery action a1 ∈ A(s1 ), σs1 ,a1 isastrategyinthe T 1stageproblemthatis deﬁnedby

σs

Withthisnotation,theright-handsidein Eq.(1.5) isequalto

where α capturesthemixedactionplayedinthefirststage.Thecontinuation strategies (σs1 ,a1 )a1 ∈A1 (s1 ) donotaffectthepayoffinthefirststage r(s1,a1 ) Theaction a1 thatischoseninthefirststageaffectsthecontinuationpayoff intwoways.First,itdeterminestheprobability q(s2 | s1,a1 ) thatthestate inthefirststageis s2 .Second,itdeterminesthecontinuationstrategy σs1 ,a1 Sincetheprobabilitydistribution Ps1 ,σ conditionalon a1 and s2 isequaltothe probabilitydistribution Ps2 ,σs1 ,a1 ,itfollowsthatwecansplitthemaximization problemin Eq.(1.6) intotwoparts,andobtainthat

(s1 ) =

Notethat

(A(s1 ))

hence,theright-handsideof Eq.(1.7) isequalto

α ∈ (A(s1 ))

Thefunctionwithintheparenthesesislinearin α ,and (A(s1 )) isacompact setwhoseextremepointsaretheDiracmeasuresconcentratedatthepoints a1 with a1 ∈ A(s1 ).Alinearfunctionthatisdeﬁnedonacompactsetattainsits maximuminanextremepoint.Theresultfollows.

Theproofof Theorem1.17 yieldsanalgorithmthatcalculatesthe T -stage valueanda T -stageoptimalstrategy σ ∗ .Wewillcalculatebyinductiona k -stageoptimalstrategy σ ∗ k forevery k = 1, 2,...,T .Westartwith k = 1, andcalculateaone-stageoptimalstrategyforeveryinitialstate s ∈ S .Let a ∗ 1 (s) ∈ A(s) beanactionthatmaximizesthequantity r(s,a) over a ∈ A(s), andset

1 (s) := a ∗ 1 (s).

Thevalueoftheone-stageproblemwithinitialstate s is v1 (s) = r(s1,a ∗ 1 (s))

Wecontinuerecursively.Supposethat,foreveryinitialstate s ,wealready calculated vk 1 (s) andalreadydeﬁneda (k 1)-stageoptimalstrategy σ ∗ k 1

Tocalculate vk (s) anddeﬁnea k -stageoptimalstrategy σ ∗ k ,wetake max a ∈A(s) 1 k r(s,a) + k 1 k q(s | s,a)vk 1 (s) , (1.8)

anddenoteby a ∗ k (s) ∈ A(s) anactionthatachievesthemaximumin Eq.(1.8)

Thisisthequantityontheright-handsideof Eq.(1.4); hence,itisequalto vk (s).Wecannowdeﬁneanoptimalstrategy σ ∗ forthedecisionmakeras follows:

•Atstage1,playtheaction a ∗ k (s1 ).

•Fromstage2on,followthestrategy σ ∗ k 1 ;thatis,ateachstage t ,whenthe currentstateis s1 and T t + 1stagesareleft,playtheaction a

1 (st ). Formally,

In Exercise1.1, thereaderisaskedtoprovethatthisstrategyisindeed T -stage optimal.

Theproofof Theorem1.17 reliesonthelinearityofthepayofffunction: thegoalofthedecisionmakeristomaximizealinearfunctionofthestage payoffs.Ifthesetsofactionsandstatesarenotﬁnite,thetheoremstillholds, providedthatin Eq.(1.4) wereplacemaximumbysupremum.

Theorem1.17 admitsthefollowingcorollary.

Theorem1.18 The T -stagevaluealwaysexists.Moreover,thereexistsan optimalpurestrategy σ ∈ .

Onecanshowastrongerresultconcerningthestructureofanoptimalpure strategy:thereexistsanoptimalpurestrategy σ withthepropertythat σ(ht ) dependsonthecurrentstate st andonthestage t ,andisindependentofthe restofthehistory (s1,a1,...,st 1,at 1 ) (Exercise1.3).

1.4TheDiscountedPayoff

Thediscountedpayoffdependsonaparameter λ ∈ (0, 1],calledthe discount factor,whichmeasureshowmoneygrowswithtime:onedollartodayisworth 1 1 λ dollarstomorrow, 1 (1 λ)2 dollarsthedayaftertomorrow,andsoon.In otherwords,thedecisionmakerisindifferentbetweengetting1 λ dollars todayandonedollartomorrow.

Deﬁnition1.19 Foreverydiscountfactor λ ∈ (0, 1],everystate s ∈ S ,and everystrategy σ ∈ ,the λ-discountedpayoff understrategyproﬁle σ atthe initialstate s is

The λ in Eq.(1.9) servesasanormalizationfactor:aplayerwhoreceives onedollarateverystageevaluatesthisstreamofpayoffsasonedollar.Since thereareﬁnitelymanystatesandactions,thepayofffunction r isbounded,

andtherefore γλ obeysthesamebound(whichisindependentof λ,thanks tothemultiplicationby λ).

Thedominatedconvergencetheorem(see,e.g.,Shiryaev(1995),theorem 6.3)impliesthat

Simplealgebraicmanipulationsyield

Foreverytwostates s,s ∈ S andeveryaction a ∈ A(s),set γλ (s ; σs,a ) :

Thisistheexpecteddiscountedpayofffromstage2on,whenconditioningon thehistoryatstage2.Alternatively,thisistheexpecteddiscountedpayoffwhen theinitialstateis s ,andthedecisionmakerfollowsthatpartofherstrategy thatfollowsthehistory (s,a).If σ isastationarystrategy,thenthewayitplays aftertheﬁrststagedoesnotdependontheplayintheﬁrststage.Hence,inthis case,foreverytwostates s,s ∈ S andeveryaction a ∈ A(s) wehave γλ (s ; σs,a ) = γλ (s ; σ).

From Eq.(1.10) weobtain:

Thus,theexpectedpayoffisaweightedaverageofthepayoff r(s1,a1 ) at thefirststageandtheexpectedpayoff γλ (s2 ; σs1 ,a1 ) inallsubsequentstages. Whenthediscountfactor λ ishigh,theweightofthefirststageishigh;whereas whenthediscountfactor λ islow,theweightofthefirststageislow.

Eq.(1.11) illustratesthatthedecisionmaker’spayoffconsistsoftwo parts:today’spayoffandthefuture’spayoff.Thediscountfactorindicates therelativeimportanceofeachpart.Thelowerthediscountfactor,thehigher theimportanceofthefuture,andthereforethedecisionmakershouldput moreweightonfutureopportunities.Thehigherthediscountfactor,thehigher theimportanceofthepresent,andthedecisionmakershouldconcentrateon short-termgains.

Figure1.3TheMarkovdecisionproblemin Example1.3.

Comment1.20 Intheproofof Theorem1.17 weinfactshowedthatthe T -stagepayoffsatisﬁesthefollowingformula:

Thus,similartothediscountedpayoff,the T -stagepayoffisaweighted averageofthepayoff r(s1,a1 ) attheﬁrststageandtheexpectedpayoff γT 1 (s2 ; σs1 ,a1 ) inallsubsequentstages,withweights 1 T and T 1 T .

Example1.3, continued TheMarkovdecisionproblemin Example1.3 is reproducedin Figure1.3.

Theinitialstateis s(1).Thestrategy σD thatalwaysplays D atstate s(1) yieldsapayoff5ateverystage,andthereforeits λ-discountedpayoffis5as well.Letuscalculatethe λ-discountedpayoffofthestrategy σU thatalways plays U atstate s(1).Sincethisstrategyisstationary,

(s(1); σU )

Theterm γλ (s(1); σU ) ontheright-handsideisthediscountedpayofffromthe thirdstageandon,ifatthesecondstagetheplaymovesfrom s(2) to s(1) Eq.(1.13) solvesto

(s(1); σU ) =

For λ = 1(onlytheﬁrstdaymatters),weget γ1 (s(1); σU ) = 10, whilefor λ closeto0(thefarfuturematters),weget