A Course in Stochastic Game Theory Eilon
Solan
Visit to download the full and correct content document: https://ebookmeta.com/product/a-course-in-stochastic-game-theory-eilon-solan/
More products digital (pdf, epub, mobi) instant download maybe you interests ...
A Course of Stochastic Analysis 1st Edition Alexander
Melnikov
https://ebookmeta.com/product/a-course-of-stochasticanalysis-1st-edition-alexander-melnikov/
An Introductory Course on Mathematical Game Theory and Applications 2nd Edition González-Díaz
https://ebookmeta.com/product/an-introductory-course-onmathematical-game-theory-and-applications-2nd-edition-gonzalezdiaz/
A First Course in Spectral Theory 1st Edition Milivoje
Luki■
https://ebookmeta.com/product/a-first-course-in-spectraltheory-1st-edition-milivoje-lukic/
A First Course in Group Theory 1st Edition Bijan Davvaz
https://ebookmeta.com/product/a-first-course-in-group-theory-1stedition-bijan-davvaz/
Foundations and
Methods
of Stochastic Simulation A First Course 2nd Edition
Barry L. Nelson
https://ebookmeta.com/product/foundations-and-methods-ofstochastic-simulation-a-first-course-2nd-edition-barry-l-nelson/
A Course in Quantum Many Body Theory 1st Edition
Michele Fabrizio
https://ebookmeta.com/product/a-course-in-quantum-many-bodytheory-1st-edition-michele-fabrizio/
Stochastic Evolution Systems Linear Theory and Applications to Non Linear Filtering Probability Theory and Stochastic Modelling 89 Boris L. Rozovsky
https://ebookmeta.com/product/stochastic-evolution-systemslinear-theory-and-applications-to-non-linear-filteringprobability-theory-and-stochastic-modelling-89-boris-l-rozovsky/
Stochastic Processes Harmonizable Theory 1st Edition
M.M. Rao
https://ebookmeta.com/product/stochastic-processes-harmonizabletheory-1st-edition-m-m-rao/
Game Theory 1st Edition 50Minutes.Com
https://ebookmeta.com/product/game-theory-1st-edition-50minutescom/
ManagingEditor:IanJ.Leary,
MathematicalSciences,UniversityofSouthampton,UK
63Singularpointsofplanecurves,C.T.C.WALL
64AshortcourseonBanachspacetheory,N.L.CAROTHERS
65ElementsoftherepresentationtheoryofassociativealgebrasI,IBRAHIMASSEM,DANIEL SIMSON&ANDRZEJSKOWRO ´ NSKI
66Anintroductiontosievemethodsandtheirapplications,ALINACARMENCOJOCARU &M.RAMMURTY
67Ellipticfunctions,J.V.ARMITAGE&W.F.EBERLEIN
68Hyperbolicgeometryfromalocalviewpoint,LINDAKEEN&NIKOLALAKIC
69LecturesonKahlergeometry,ANDREIMOROIANU
70Dependencelogic,JOUKUVAANANEN
71ElementsoftherepresentationtheoryofassociativealgebrasII,DANIELSIMSON&ANDRZEJ SKOWRO ´ NSKI
72ElementsoftherepresentationtheoryofassociativealgebrasIII,DANIELSIMSON&ANDRZEJ SKOWRO ´ NSKI
73Groups,graphsandtrees,JOHNMEIER
74RepresentationtheoremsinHardyspaces,JAVADMASHREGHI
75Anintroductiontothetheoryofgraphspectra,DRAGO ˇ SCVETKOVI ´ C,PETERROWLINSON &SLOBODANSIMI ´ C
76NumbertheoryinthespiritofLiouville,KENNETHS.WILLIAMS
77Lecturesonprofinitetopicsingrouptheory,BENJAMINKLOPSCH,NIKOLAYNIKOLOV &CHRISTOPHERVOLL
78Cliffordalgebras:Anintroduction,D.J.H.GARLING
79IntroductiontocompactRiemannsurfacesanddessinsd’enfants,ERNESTOGIRONDO& GABINOGONZ ´ ALEZ–DIEZ
80TheRiemannhypothesisforfunctionfields,MACHIELVANFRANKENHUIJSEN
81Numbertheory,Fourieranalysisandgeometricdiscrepancy,GIANCARLOTRAVAGLINI
82Finitegeometryandcombinatorialapplications,SIMEONBALL
83Thegeometryofcelestialmechanics,HANSJORGGEIGES
84Randomgraphs,geometryandasymptoticstructure,MICHAELKRIVELEVICH etal
85Fourieranalysis:PartI–Theory,ADRIANCONSTANTIN
86Dispersivepartialdifferentialequations,M.BURAKERDO ˘ GAN&NIKOLAOSTZIRAKIS
87Riemannsurfacesandalgebraiccurves,R.CAVALIERI&E.MILES
88Groups,languagesandautomata,DEREKF.HOLT,SARAHREES&CLAASE.ROVER
89AnalysisonPolishspacesandanintroductiontooptimaltransportation,D.J.H.GARLING
90Thehomotopytheoryof (∞, 1)-categories,JULIAE.BERGNER
91TheblocktheoryoffinitegroupalgebrasI,MARKUSLINCKELMANN
92TheblocktheoryoffinitegroupalgebrasII,MARKUSLINCKELMANN
93Semigroupsoflinearoperators,DAVIDAPPLEBAUM
94Introductiontoapproximategroups,MATTHEWC.H.TOINTON
95RepresentationsoffinitegroupsofLietype(2ndEdition),FRANC¸OISDIGNE&JEANMICHEL
96TensorproductsofC*-algebrasandoperatorspaces,GILLESPISIER
97Topicsincyclictheory,DANIELG.QUILLEN&GORDONBLOWER
98Fasttracktoforcing,MIRNAD ˇ ZAMONJA
99Agentleintroductiontohomologicalmirrorsymmetry,RAFBOCKLANDT
100Thecalculusofbraids,PATRICKDEHORNOY
101Classicalanddiscretefunctionalanalysiswithmeasuretheory,MARTINBUNTINAS
102NotesonHamiltoniandynamicalsystems,ANTONIOGIORGILLI
Introduction
Stochasticgamesareamathematicalmodelthatisusedtostudydynamic interactionsamongagentswhoinfluencetheevolutionoftheenvironment. ThesegameswerefirstpresentedandstudiedbyLloydShapley(1953).1 , 2 SinceShapley’sseminalwork,theliteratureonstochasticgamesexpanded considerably,andthemodelwasappliedtonumerousareas,suchasarmsrace, fisherywars,andtaxation.
Astochasticgameisplayedindiscretetimebyafiniteset I ofplayers,and itconsistsofafinitenumberofstates.Ineachstate s ,eachplayer i ∈ I hasa givensetofactions,denoted Ai (s).Ineverystage t ∈ N,theplayisinoneof thestates,denoted st .Eachplayer i ∈ I choosesanaction a i t ∈ Ai (st ) thatis availabletoheratthecurrentstage,receivesastagepayoff,whichdependson thecurrentstate st aswellasontheactions (a j t )j ∈I chosenbytheplayers,and anewstate st +1 ischosen,accordingtoaprobabilitydistributionthatdepends onthecurrentstateandontheactionsoftheplayers (a j t )j ∈I
Inastochasticgame,theplayershavetwo,seeminglycontradicting,goals. First,theyneedtoensurethattheirfutureopportunitiesremainhigh.Atthe sametime,theyshouldmakesurethattheirstagepayoffisalsohigh.This dichotomymakestheanalysisofstochasticgamesintriguingandnottrivial.
Thestudyofstochasticgamesusestoolsfrommanymathematicalbranches, suchasprobability,analysis,algebra,differentialequations,andcombinatorics.Thegoalofthisbookistopresentthetheorythroughthemathematical techniquesthatitemploys.Thus,eachchapterpresentsmathematicalresults
1 LloydStowellShapley(Cambridge,Massachusetts,June2,1923–Tucson,Arizona,March12, 2016)wasanAmericanmathematicianwhomademanyinfluentialcontributionstoGame Theory,liketheShapleyvalue,stochasticgames,andthedefer-acceptancealgorithmforstable marriages.Shapleysharedthe2012NobelPrizeinEconomicstogetherwithgametheorist AlvinRoth.
2 AllcommentaryistakenfromWikipedia.
fromsomebranchofmathematics,andusesthemtoproveresultsonstochastic games.Thegoalisnottoprovethemostgeneraltheoremsinstochasticgames, butrathertopresentthebeautyofthetheory.Accordingly,wesometimes restrictthescopeoftheresultsthatisproven,toallowforsimplerproofsthat bypasstechnicaldifficulties.
Thematerialinthisbookissummarizedbythefollowingtable:
ChapterTool + Result
1 Contractingmappings
StationaryoptimalstrategiesinMarkovdecisionproblems
2 TauberianTheorem
Uniform -optimalityinhiddenMarkovdecisionproblems
5 Contractingmappings
Stationarydiscountedoptimalstrategiesinzero-sumstochastic games
6 Semi-algebraicmappings
Existenceofthelimitofthediscountedvalue
7 B -graphs
Continuityofthelimitofthediscountedvalue
8 Kakutani’sfixedpointtheorem
Stationarydiscountedequilibriainmultiplayerstochasticgames
9 Existenceoftheuniformvalueinzero-sumstochasticgames
10 Thevanishingdiscountfactorapproach
Existenceofuniformequilibriuminabsorbinggames
11 Ramsey’sTheorem
Existenceofundiscountedequilibriumintwo-playerdeterministic stoppinggames
12 Approximatinginfiniteorbits
Existenceofundiscountedequilibriuminmultiplayerquitting games
13 Linearcomplementarityproblems
Existenceofundiscountedequilibriuminmultiplayerquitting games
Eachchaptercontainsexercises.Solutionsareavailableassupplementary materialonthebook’spageonthepublisher’swebsite.Thebookisbased onagraduatelevelcoursethatItaughtatTelAvivUniversityformorethan
adecade.Ihopethatthereaders,asmystudents,willlikethediversityofthe topicsandtheeleganceoftheproofs.Forthebenefitofreaderswhowouldlike toexpandtheirknowledgeinstochasticgames,Iaddedreferencestorelated resultsattheendofeachchapter.Booksandsurveysthatincludematerial ondifferentaspectsofstochasticgamesincludeRaghavanetal.(1991), RaghavanandFilar(1991),FilarandVrieze(1997),Bas¸arandOlsder(1998), Mertens(2002),Vieille(2002),NeymanandSorin(2003),Solan(2008), Chatterjeeetal.(2009, 2013),ChatterjeeandHenzinger(2012),Larakiand Sorin(2015),Mertensetal.(2015),SolanandVieille(2015),Solanand Ziliotto(2016),Bas¸arandZaccour(2017),Ja ´ skiewiczandNowak(2018a,b), andRenault(2019).
IendtheintroductionbythankingAyalaMashiah-Yaakovi,whoreadthe manuscriptandthesolutionmanualandmademanycommentsthatimproved thetext;AndreiIacob,whocopyeditedthetext;andJohnYehudaLevy, AndrzejNowak,RobertSimon,BernhardvonStengel,UriZwick,andmy studentsthroughouttheyearsforprovidingcommentsandspottingtypos.
Notation
Thesetofpositiveintegersis
N :={1, 2, 3,... }
Thenumberofelementsinafiniteset K isdenotedby |K |.Foreveryfinite set K ,thesetofprobabilitydistributionsover K isdenotedby (K).We identifyeachelement k ∈ K withtheprobabilitydistributionin (K) that assignsprobability1to k .Foraprobabilitydistribution μ ∈ (K),the support of μ,denotedsupp(μ),isthesetofallelements k ∈ K thathavepositive probabilityunder μ: supp(μ) :={k ∈ K : μ[k ] > 0}.
Aprobabilitydistributionis pure ifsupp(μ) containsonlyoneelement: |supp(μ)|= 1.
Let I beafiniteset,and,foreach i ∈ I ,let Ai beaset.Wedenoteby AI := i ∈I Ai thecartesianproduct,anddenote A i := j ∈I \{i } Aj .Similarly,if a = (a i )i ∈I ∈ AI ,wedenoteby a i := (a j )j ∈I \{i } ∈ A i thevector a with its i ’thcoordinateremoved.
Wewillusetwonorms,the L1 -norm andthe L∞ -norm (orthemaximum norm).Foravector x ∈ Rn ,wedefine
Forafunction f : X → R,argmaxx ∈X f(x) isthesetofallpointsin X that maximize f :
argmaxx ∈X f(x) := y ∈ X : f(y) = max x ∈X f(x) .
Whentheset X iscompactandthefunction f iscontinuous,theset argmaxx ∈X f(x) isnon-empty.
MarkovDecisionProblems
Inthischapter,weintroduceMarkovdecisionproblems,whicharestochastic gameswithasingleplayer.Theyserveasanappetizer.Ontheonehand, thebasicconceptsandbasicproofsforzero-sumstochasticgamesarebetter understoodinthissimplemodel.Ontheotherhand,someoftheconclusions thatwedrawforMarkovdecisionproblemsaredifferentfromthosedrawn forzero-sumstochasticgames.Thisillustratestheinherentdifference betweensingle-playerdecisionproblemsandmultiplayerdecisionproblems (=games).Theinterestedreaderisreferredto,forexample,Ross(1982)or Puterman(1994)foranexpositionofMarkovdecisionproblems.
Wewillstudyboththe T -stageevaluationandthediscountedevaluation. Wewillintroduceandstudycontractingmappings,1 andwillusesuch mappingstoshowthatthedecisionmakerhasastationarydiscountedoptimal strategy.Wewillalsodefinetheconceptofuniformoptimality,andshowthat thedecisionmakerhasastationaryuniformlyoptimalstrategy.
Definition1.1 A Markovdecisionproblem2 isavector = S,(A(s))s ∈S , q,r where
• S isafinitesetofstates.
•Foreach s ∈ S , A(s) isafinitesetofactionsavailableatstate s .Thesetof pairs (state,action) isdenotedby
SA :={(s,a) : s ∈ S,a ∈ A(s)}.
• q : SA → (S) isatransitionrule.
• r : SA → R isapayofffunction.
1 Weadheretotheconventionthatamappingisafunctionwhoserangeisageneralspaceor Rn , whileafunctionisalwaysreal-valued.
2 AndreyAndreyevichMarkov(Ryazan,Russia,June14,1856–St.Petersburg,Russia,July20, 1922)wasaRussianmathematician.Heisbestknownforhisworkonthetheoryofstochastic processesthatnowbearhisname:MarkovchainsandMarkovprocesses.
AMarkovdecisionprobleminvolvesadecisionmaker,anditevolvesas follows.Theproblemlastsforinfinitelymanystages.Theinitialstate s1 ∈ S isgiven.Ateachstage t ≥ 1,thefollowinghappens:
•Thecurrentstate st isannouncedtothedecisionmaker.
•Thedecisionmakerchoosesanaction at ∈ A(st ) andreceivesthestage payoff r(st ,at ).
•Anewstate st +1 isdrawnaccordingto q(·| st ,at ),andthegameproceeds tostage t + 1.
Example1.2
Considerthefollowingsituation.Thetechnologicallevelofa countrycanbeHigh (H),Medium (M),orLow (L).Theannualinvestment ofthecountryintechnologicaladvancescanalsobehigh (2billiondollars), medium (1billiondollars),orlow (0.5billiondollars).Theannualgain fromtechnologicallevelisincreasing:thehigh,medium,andlowtechnologicallevelyield10,6,and2billiondollars,respectively.Thetechnological levelchangesstochasticallyasafunctionoftheinvestmentintechnologicaladvancement,accordingtothefollowingtable:3
HighMediumLow Technologylevelinvestmentinvestmentinvestment
ThesituationcanbepresentedasaMarkovdecisionproblemasfollows:
•Therearethreestates,whichrepresentthethreetechnologicallevels: S ={H,M,L}.
•Therearethreeactionsineachstate,whichrepresentthethreeinvestment levels: A(s) ={h,m,l } foreach s ∈ S .
•Thetransitionruleisgivenby
3 Hereandinthesequel,aprobabilitydistributionisdenotedbyalistofprobabilitiesand outcomesinsquarebrackets,wheretheoutcomesarewrittenwithinroundbrackets. Thus, 2 3 (H), 1 3 (M) meansaprobabilitydistributionthatassignsprobability 2 3 to H and probability 1 3 to M
q(H | H,h) = 1,q(M | H,h) = 0,q(L | H,h) = 0,
q(H | H,m) = 1 2 ,q(M | H,m) = 1 2 ,q(L | H,m) = 0,
q(H | H,l) = 1 4 ,q(M | H,l) = 3 4 ,q(L | H,l) = 0,
q(H | M,h) = 3 5 ,q(M | M,h) = 2 5 ,q(L | M,h) = 0,
q(H | M,m) = 0,q(M | M,m) = 1,q(L | M,m) = 0,
q(H | M,l) = 0,q(M | M,l) = 2 5 ,q(L | M,l) = 3 5 ,
q(H | L,h) = 0,q(M | L,h) = 3 5 ,q(L | L,h) = 2 5 ,
q(H | L,m) = 0,q(M | L,m) = 2 5 ,q(L | L,m) = 3 5 ,
q(H | L,l) = 0,q(M | L,l) = 0,q(L | L,l) = 1.
•Thepayofffunction (inbillionsofdollars) isgivenby
r(H,h) = 8,r(H,m) = 9,r(H,l) = 9 1 2 ,
r(M,h) = 4,r(M,m) = 5,r(M,l) = 5 1 2 ,
r(L,h) = 0,r(L,m) = 1,r(L,l) = 1 1 2 .
Example1.3 TheMarkovdecisionproblemthatisillustratedin Figure1.1 isformallydefinedasfollows:
•Therearethreestates: S ={s(1),s(2),s(3)}.
•Instate s(1),therearetwoactions: A(s(1)) ={U,D };instates s(2) and s(3),thereisoneaction: A(s(2)) = A(s(3)) ={D }.
•Payoffsappearatthecenterofeachentryandaregivenby:
r(s(1),U) = 10; r(s(1),D) = 5; r(s(2),D) = 10; r(s(3),D) =−100.
•Transitionsappearinparenthesesnexttothepayoffandaregivenby:
– Ifinstate s(1) thedecisionmakerchooses U ,theprocessmovestostate s(2),thatis, q(s(2) | s(1),U) = 1.
– Ifinstate s(1) thedecisionmakerchooses D , theprocessremainsin state s(1),thatis, q(s(1) | s(1),D) = 1.
10(0, 1, 0)
5(1, 0, 0)
s(1)
100(0, 0, 1) DD
10 1 10 , 0, 9 10
s(2)s (3)
Figure1.1TheMarkovdecisionproblemin Example1.3
– Fromstate s(2),theprocessmovestostate s(1) withprobability 1 10 and tostate s(3) withprobability 9 10 ,thatis, q(s(1) | s(2),D) = 1 10 and q(s(3) | s(2),D) = 9 10 .
– Oncetheprocessreachesstate s(3),itstaysthere,thatis, q(s(3) | s(3),D) = 1.
1.1OnHistories
For t ∈ N,thesetof historiesoflength t isdefinedby Ht := (SA)t 1 × S, wherebyconvention (SA)0 =∅.Thisisthesetofallhistoriesthatmayoccur untilstage t .Atypicalelementin Ht isdenotedby ht .Thelaststateofhistory ht isdenotedby st .Theset H1 isidentifiedwiththestatespace S ,andthe history (s1 ) issimplydenotedby s1 . Wedenotethesetofall histories by H := t ∈N Ht , andthesetofall infinitehistories or plays by H∞ := (SA)N
Thesetofplays H∞ isameasurablespace,withthesigma-algebra generatedbythecylindersets,whicharedefinedasfollows.Forahistory ht = (s1, a1,..., st ) ∈ Ht ,the cylinderset C(ht ) ⊂ H∞ isthecollectionof allplaysthatstartwith ht ,thatis,
C(ht ) :={h = (s1,a1,s2,a2,...) ∈ H∞
Forevery t ∈ N,thecollectionofallcylindersets (C(ht ))ht ∈Ht definesa finitepartition,oranalgebra,on H∞ .Wedenoteby Ht thisalgebraandby H thesigma-algebraon H∞ generatedbythealgebras (Ht )t ∈N .
1.2OnStrategies
A mixedaction atstate s isaprobabilitydistributionoverthesetofactions A(s) availableatstate s .Thesetofmixedactionsatstate s istherefore (A(s)).Astrategyofthedecisionmakerspecifieshowthedecisionmaker shouldplayaftereachpossiblehistory.
Definition1.4 A strategy isamapping σ thatassignstoeachhistory h = (s1,a1,...,at 1,st ) amixedactionin (A(st )).
Thesetofallstrategiesisdenotedby .
Adecisionmakerwhofollowsastrategy σ behavesasfollows:ateach stage t ,giventhepasthistory (s1,a1,...,st ),thedecisionmakerchoosesan action at accordingtothemixedaction σ(·| s1,a1,...,st ).
Comment1.5 Astrategyasdefinedin Definition1.4 istermedinthe literature behaviorstrategy.
Comment1.6
Thefactthatthechoiceofthedecisionmakerdependson pastplayimplicitlyassumesthatthedecisionmakerknowsthepastplay;that is,thedecisionmakerobserves(andremembers)allpaststatesthattheprocess visited,andsheremembersallherpastchoices.In Chapter2, wewillstudythe modelofMarkovdecisionproblemswhenthedecisionmakerdoesnotobserve thestate.
Comment1.7 Astrategycontainsalotofirrelevantinformation.Indeed, whentheinitialstateis s1 = s ,itisnotimportantwhatthedecisionmaker wouldplayiftheinitialstatewere s = s .Similarly,ifinthefirststagethe decisionmakerplayedtheaction a1 = a ,itisirrelevantwhatshewould playinthesecondstageifsheplayedtheaction a = a inthefirststage.We neverthelessregardastrategyasamappingdefinedonthesetof all histories, becauseofthesimplicityofthedefinition;otherwisewewouldhavetodefine foreverystrategy σ andeverypositiveinteger t thesetofallhistoriesoflength t thatcanoccurwithpositiveprobabilitywhenthedecisionmakerfollows strategy σ (whichdependonthedefinitionof σ uptostage t 1),anddefine σ atstage t onlyforthosehistories.
Everystrategy σ ,togetherwiththeinitialstate s1 ,definesaprobability distribution Ps1 ,σ onthespaceofmeasurablespace (H∞, H ).Todefinethis probabilitydistributionformally,wedefineitonthecollectionofcylindersets thatgenerate (H∞, H ) bytherule Ps1 ,σ
+1 | sk , ak )
Let Ps1 ,σ betheuniqueprobabilitydistributionon H∞ thatagreeswiththis definitiononcylindersets.Thefactthat,inthisway,weindeedobtaina uniqueprobabilitydistributionisguaranteedbytheCarath ´ eodory4 Extension Theorem(see,e.g.,theorem3.1inBillingsley(1995)).
4 ConstantinCarath ´ eodory(Berlin,Germany,September13,1873–Munich,Germany, February2,1950)wasaGreekmathematicianwhospentmostofhiscareerinGermany. Hemadesignificantcontributionstothetheoryoffunctionsofarealvariable,thecalculus ofvariations,andmeasuretheory.Hisworkalsoincludesimportantresultsinconformal representationsandinthetheoryofboundarycorrespondence.
Twosimpleclassesofstrategiesarepurestrategiesthatinvolvenorandomization,andstationarystrategiesthatdependonlyonthecurrentstateandnot onthewholepasthistory.
Definition1.8 Astrategy σ is pure if |supp(σ(ht ))|= 1foreveryhistory ht ∈ H .
Thesetofpurestrategiesisdenotedby P .
Definition1.9 Astrategy σ is stationary if,foreverytwohistories ht = (s1,a1,s2,...,at 1,st ) and hk = (s1, a1, s2,..., ak 1, sk ) thatsatisfy st = sk ,wehave σ(ht ) = σ(hk ).
Thesetofstationarystrategiesisdenotedby S .
Apurestationarystrategyassignstoeachstate s ∈ S anactionin A(s). Sincethenumberofactionsin A(s) is |A(s)|,wecanexpressthenumberof purestationarystrategiesintermsofthedataoftheMarkovdecisionproblem.
Theorem1.10 Thenumberofpurestationarystrategiesis s ∈S |A(s)|.
Onecanidentifyastationarystrategy σ withavector x ∈ s ∈S (A(s)). Withthisidentification, x(s) isthemixedactionchosenwhenthecurrentstate is s .Thus,thesetofstationarystrategies S canbeidentifiedwiththespace X := s ∈S (A(s)),whichisconvexandcompact.Foreveryelement x ∈ X , thestationarystrategythatcorrespondsto x isstilldenotedby x .
In Definition1.4 wedefinedastrategytobeamappingfromhistoriesto mixedactions.Wenowpresentanotherconceptofastrategythatinvolves randomization–amixedstrategy.
Definition1.11 A mixedstrategy isaprobabilitydistributionovertheset P ofpurestrategies. Everystrategyisequivalenttoamixedstrategy.Indeed,astrategy σ isdefinedby ℵ0 lotteries:toeachhistory ht ∈ H ,itassignsalottery σ(ht ) ∈ (A(st )).Ifthedecisionmakerperformsallthe ℵ0 lotteriesbefore theplaystarts,thentherealizationsofthelotteriesdefineapurestrategy.In particular,thestrategydefinesaprobabilitydistributionoverthesetofpure strategies.
Conversely,everymixedstrategyisequivalenttoastrategy.Indeed,given amixedstrategy τ ,onecancalculateforeachhistory ht theconditional probability σ(at | ht ) thattheactionchosenafter ht is at ∈ A(st ).Ifthehistory ht occurswithprobability0under Ps1 ,σ ,weset σ(at | ht ) arbitrarily.Onecan showthatthestrategy σ isequivalenttothemixedstrategy τ .
Theequivalencejustdescribedisaspecialcaseofamoregeneralresult called Kuhn’sTheorem;5 see,forexample,Maschler,Solan,andZamir(2020, chapter7).
1.3The T -StagePayoff
Thedecisionmakerreceivesthestagepayoff r(st ,at ) ateverystage t .How doesshecomparesequencesofstagepayoffs?Wewillstudytwomethods ofevaluations.Thefirst,whichweconsiderinthissection,isthe T -stage evaluation.Thisevaluationisrelevantwhentheprocesslasts T stages,and thegoalofthedecisionmakeristomaximizeherexpectedaveragepayoff duringthesestages.Thesecond,whichwewillstudyinthenextsection,isthe discountedevaluation,whichisrelevantwhentheplaycontinuesindefinitely, andthegoalofthedecisionmakeristomaximizetheexpecteddiscountedsum ofherstagepayoffs.
Theexpectationoperatorfortheprobabilitydistribution Ps1 ,σ isdenotedby Es1 ,σ [ · ].Inparticular, Es1 ,σ [r(st ,at )]istheexpectedpayoffatstage t .
Definition1.12 Foreverypositiveinteger T ∈ N,everyinitialstate s1 ∈ S , andeverystrategy σ ∈ ,definethe T -stagepayoff by:
Example1.13 TheMarkovdecisionprobleminthisexampleisgivenin Figure1.2.
Theinitialstateis s(1).Wewillcalculatethe T -stagepayoffofeverypure strategy.
Figure1.2TheMarkovdecisionproblemin Example1.13
5 HaroldWilliamKuhn(SantaMonica,California,July29,1925–NewYorkCity,NewYork, July2,2014)wasanAmericanmathematician.HeisknownfortheKarush–Kuhn–Tucker conditions,forKuhn’stheorem,andfordevelopingKuhnpokeraswellasthedescriptionofthe Hungarianmethodfortheassignmentproblem.
Thestrategy σD thatalwaysplays D yieldsapayoff5ateverystage,and thereforeits T -stagepayoffis5aswell:
γT (s(1); σD ) = 5, ∀T ∈ N.
Thestrategy σU thatplays U inthefirststageyields10inthefirststageand2 inallsubsequentstages.Therefore,
(s(1); σU ) = 10
Forevery0 ≤ t<T ,thestrategy σDt U thatplays D inthefirst t stagesand U instage t + 1yields5inthefirst t stages,10instage t + 1,and2inall subsequentstages.Therefore,
(s(1); σDt U ) =
Definition1.14 Let s ∈ S andlet T ∈ N.Therealnumber vT (s) isthe T -stagevalueattheinitialstate s if
Anystrategyinargmaxσ ∈ γT (s ; σ) is T -stageoptimalat s
Inotherwords,the T -stagevalueat s isthemaximalamountthatthedecision makercangetwhentheinitialstateis s ,andastrategythatguaranteesthis quantityis T -stageoptimal.
Isthesupremumin Eq.(1.3) attained?Thatis,istherea T -stageoptimal strategy?As Theorem1.15 states,theanswerispositive.
Theorem1.15 Forevery s ∈ S andevery T ≥ 1,thereisa T -stageoptimal strategyattheinitialstate s .
Proof Inthe T -stagegame,theonlyrelevantpartofthestrategyisitsplay uptostage T .Inparticular,forthepurposeofstudyingthe T -stageproblem, wecandefineastrategyasamapping σ : T t =1 Ht → s ∈S (A(s)), suchthat σ(ht ) ∈ (A(st )),foreveryhistory ht ∈ T t =1 Ht .Thissetisa compactsubsetofaEuclideanspace.Thepayofffunctioniscontinuousonthis set.Sinceacontinuousfunctiondefinedonacompactsetattainsitsmaximum, theresultfollows.
Comment1.16 Wecanstrengthen Theorem1.15 andprovethat,forevery s ∈ S andevery T ≥ 1,thereisa T -stage pure optimalstrategyattheinitial state s (see Theorem1.18).Toseethis,considerthefunctionthatmapseach mixedstrategy σ intothe T -stagepayoff γT (s ; σ).Thisfunctionislinear. Indeed,let σ1 and σ2 betwostrategies,andlet σ3 bethefollowingstrategy: tossafaircoin;iftheresultisHead,follow σ1 ,whereasifitisTail,follow σ2 . Then
.
BytheKrein–Milman6 Theorem,alinearfunctionthatisdefinedonacompact spaceattainsitsmaximumatanextremepoint.Sincethepurestrategiesare theextremepointsofthesetofmixedstrategies,itfollowsthatthefunction σ → γT (s ; σ) attainsitsmaximumatapurestrategy.
Example1.3, continued Thequantity γT (s(1); σDt U ) = 2T +3t +8 T is maximizedwhen t = T 1:thedecisionmakerplays T 1times D ,and thensheplays U once.Theresultingaveragepayoffis5 + 5 T .The T -stage valueattheinitialstate s(1) istherefore vT (s(1)) = 5 + 5 T .
Ingeneral,the T -stagevalue,aswellasthe T -stageoptimalstrategies,can befoundby backwardinduction,amethodthatisalsoknownasthe dynamic programmingprinciple.Wenowformalizethismethod.
Theorem1.17 Foreveryinitialstate s1 ∈ S andevery T ≥ 2,wehave
Eq.(1.4) statesthat,tocalculatethe T -stagevalue,wecanbreakthe problemintotwoparts:thefirststage,andthelast T 1stages.Since transitionsandpayoffsdependonlyonthecurrentstateandonthecurrent action,theproblemthatstartsatstage2isnotaffectedby s1 and a1 ,the stateandactionatstage1.Thisproblemisa (T 1)-stageMarkovdecision problem,whosevalue vT 1 (s2 ) dependsonitsinitialstate(andnotonthe initialstate s1 ).Tocalculatethe T -stagevalue,wecollapsethelast T 1 stagesintoasinglenumber,thevalueofthe (T 1)-stageproblemthatstarts
6 MarkGrigorievichKrein(Kiev,Russia,April3,1907–Odessa,Ukraine,October17,1989) wasaSovietmathematicianwhoisbestknownforhisworkinoperatortheory.
DavidPinhusovichMilman(Kiev,Russia,January15,1912–TelAviv,Israel,July12,1982) wasaSovietandlaterIsraelimathematicianspecializinginfunctionalanalysis.
atstage2,andweaskwhatistheoptimalactioninthefirststage,assuming thatifstate s2 isreachedatstage2,thecontinuationvalueis vT 1 (s2 ).
In Eq.(1.4) theweightofthepayoffinthefirststage, r(s1,a1 ),is 1 T ,and theweightofthevalueofthe (T 1)-stageproblemthatencapsulatesthe last T 1stagesis T 1 T .Whydowetaketheseweights?Thereasonisthat thequantity r(s1,a1 ) representsthepayoffinthefirststage,whilethequantity vT 1 (s2 ) capturestheaveragepayoffin T 1stages:stages2, 3,...,T .The weightsofeachofthetwoquantitiesreflectthispoint.
Toprove Theorem1.17, wewillconsiderconditionalexpectation.Recall that Es1 ,σ [r(st ,at )]istheexpectedpayoffatstage t .Forevery t ≤ t andeveryhistory ht = (s1, a1,..., st ) ∈ Ht with s1 = s1 ,thequantity Es1 ,σ [r(st ,at ) | ht ]istheexpectedpayoffatstage t ,conditionalthatthe history ht hasoccurred,thatis,conditionalthattheactionintheinitial stateis a1 ,thestateatstage2is s2 ,andsoon.Formally,foreveryhistory ht = (s1, a1,..., st ) ∈ Ht ,theprobabilitydistribution Ps1 ,σ (·| ht ) is definedasfollows:
•Forhistoriesthatarenotlongerthan ht :Forevery t ≤ t wehave
s1 ,σ (C(s1,a1,...,st ) | ht ) := 1{s1 =s1 ,a1 =a1 ,...,st =st } .
•Forhistoriesthatarelongerthan ht :Forevery t>t ,wehave Ps1 ,σ (C(s1,a1,...,st 1,at 1,st ) | ht ) := 1{s1 =s1 ,a1 =a1 ,...,st =st } t 1 k =t σ(ak | s1,a1,...,sk ) × t 1 k =t q(sk +1 | sk ,ak )
Denoteby Es1 ,σ [·| ht ]theexpectationwithrespectto Ps1 ,σ (·| ht ).
Proofof Theorem1.17 For T = 1,the T -stageproblemconcernsthefirst stageonly,and v1 (s1 ) = max a1 ∈A(s1 ) r(s1,a1 ).
Inparticular, Eq.(1.4) holds.For T ≥ 2,bydefinitionandbythelawofiterated expectations,
vT (s1 )
Thetermwithinthemaximizationintheseequalitiesdependsonlyonthe partofthestrategy σ thatfollowstheinitialstate s1 .Thispartiscomposedofthemixedaction σ(s1 ) ∈ (A(s1 )) thatisplayedinthefirststage andthecontinuationstrategiesplayedfromthesecondstageandon.We denotethesecontinuationstrategiesby (σs1 ,a1 )a1 ∈A(s1 ) .Formally,forevery action a1 ∈ A(s1 ), σs1 ,a1 isastrategyinthe T 1stageproblemthatis definedby
σs
Withthisnotation,theright-handsidein Eq.(1.5) isequalto
where α capturesthemixedactionplayedinthefirststage.Thecontinuation strategies (σs1 ,a1 )a1 ∈A1 (s1 ) donotaffectthepayoffinthefirststage r(s1,a1 ) Theaction a1 thatischoseninthefirststageaffectsthecontinuationpayoff intwoways.First,itdeterminestheprobability q(s2 | s1,a1 ) thatthestate inthefirststageis s2 .Second,itdeterminesthecontinuationstrategy σs1 ,a1 Sincetheprobabilitydistribution Ps1 ,σ conditionalon a1 and s2 isequaltothe probabilitydistribution Ps2 ,σs1 ,a1 ,itfollowsthatwecansplitthemaximization problemin Eq.(1.6) intotwoparts,andobtainthat
(s1 ) =
Notethat
(A(s1 ))
hence,theright-handsideof Eq.(1.7) isequalto
α ∈ (A(s1 ))
Thefunctionwithintheparenthesesislinearin α ,and (A(s1 )) isacompact setwhoseextremepointsaretheDiracmeasuresconcentratedatthepoints a1 with a1 ∈ A(s1 ).Alinearfunctionthatisdefinedonacompactsetattainsits maximuminanextremepoint.Theresultfollows.
Theproofof Theorem1.17 yieldsanalgorithmthatcalculatesthe T -stage valueanda T -stageoptimalstrategy σ ∗ .Wewillcalculatebyinductiona k -stageoptimalstrategy σ ∗ k forevery k = 1, 2,...,T .Westartwith k = 1, andcalculateaone-stageoptimalstrategyforeveryinitialstate s ∈ S .Let a ∗ 1 (s) ∈ A(s) beanactionthatmaximizesthequantity r(s,a) over a ∈ A(s), andset
1 (s) := a ∗ 1 (s).
Thevalueoftheone-stageproblemwithinitialstate s is v1 (s) = r(s1,a ∗ 1 (s))
Wecontinuerecursively.Supposethat,foreveryinitialstate s ,wealready calculated vk 1 (s) andalreadydefineda (k 1)-stageoptimalstrategy σ ∗ k 1
Tocalculate vk (s) anddefinea k -stageoptimalstrategy σ ∗ k ,wetake max a ∈A(s) 1 k r(s,a) + k 1 k q(s | s,a)vk 1 (s) , (1.8)
anddenoteby a ∗ k (s) ∈ A(s) anactionthatachievesthemaximumin Eq.(1.8)
Thisisthequantityontheright-handsideof Eq.(1.4); hence,itisequalto vk (s).Wecannowdefineanoptimalstrategy σ ∗ forthedecisionmakeras follows:
•Atstage1,playtheaction a ∗ k (s1 ).
•Fromstage2on,followthestrategy σ ∗ k 1 ;thatis,ateachstage t ,whenthe currentstateis s1 and T t + 1stagesareleft,playtheaction a
1 (st ). Formally,
In Exercise1.1, thereaderisaskedtoprovethatthisstrategyisindeed T -stage optimal.
Theproofof Theorem1.17 reliesonthelinearityofthepayofffunction: thegoalofthedecisionmakeristomaximizealinearfunctionofthestage payoffs.Ifthesetsofactionsandstatesarenotfinite,thetheoremstillholds, providedthatin Eq.(1.4) wereplacemaximumbysupremum.
Theorem1.17 admitsthefollowingcorollary.
Theorem1.18 The T -stagevaluealwaysexists.Moreover,thereexistsan optimalpurestrategy σ ∈ .
Onecanshowastrongerresultconcerningthestructureofanoptimalpure strategy:thereexistsanoptimalpurestrategy σ withthepropertythat σ(ht ) dependsonthecurrentstate st andonthestage t ,andisindependentofthe restofthehistory (s1,a1,...,st 1,at 1 ) (Exercise1.3).
1.4TheDiscountedPayoff
Thediscountedpayoffdependsonaparameter λ ∈ (0, 1],calledthe discount factor,whichmeasureshowmoneygrowswithtime:onedollartodayisworth 1 1 λ dollarstomorrow, 1 (1 λ)2 dollarsthedayaftertomorrow,andsoon.In otherwords,thedecisionmakerisindifferentbetweengetting1 λ dollars todayandonedollartomorrow.
Definition1.19 Foreverydiscountfactor λ ∈ (0, 1],everystate s ∈ S ,and everystrategy σ ∈ ,the λ-discountedpayoff understrategyprofile σ atthe initialstate s is
The λ in Eq.(1.9) servesasanormalizationfactor:aplayerwhoreceives onedollarateverystageevaluatesthisstreamofpayoffsasonedollar.Since therearefinitelymanystatesandactions,thepayofffunction r isbounded,
andtherefore γλ obeysthesamebound(whichisindependentof λ,thanks tothemultiplicationby λ).
Thedominatedconvergencetheorem(see,e.g.,Shiryaev(1995),theorem 6.3)impliesthat
Simplealgebraicmanipulationsyield
Foreverytwostates s,s ∈ S andeveryaction a ∈ A(s),set γλ (s ; σs,a ) :
Thisistheexpecteddiscountedpayofffromstage2on,whenconditioningon thehistoryatstage2.Alternatively,thisistheexpecteddiscountedpayoffwhen theinitialstateis s ,andthedecisionmakerfollowsthatpartofherstrategy thatfollowsthehistory (s,a).If σ isastationarystrategy,thenthewayitplays afterthefirststagedoesnotdependontheplayinthefirststage.Hence,inthis case,foreverytwostates s,s ∈ S andeveryaction a ∈ A(s) wehave γλ (s ; σs,a ) = γλ (s ; σ).
From Eq.(1.10) weobtain:
Thus,theexpectedpayoffisaweightedaverageofthepayoff r(s1,a1 ) at thefirststageandtheexpectedpayoff γλ (s2 ; σs1 ,a1 ) inallsubsequentstages. Whenthediscountfactor λ ishigh,theweightofthefirststageishigh;whereas whenthediscountfactor λ islow,theweightofthefirststageislow.
Eq.(1.11) illustratesthatthedecisionmaker’spayoffconsistsoftwo parts:today’spayoffandthefuture’spayoff.Thediscountfactorindicates therelativeimportanceofeachpart.Thelowerthediscountfactor,thehigher theimportanceofthefuture,andthereforethedecisionmakershouldput moreweightonfutureopportunities.Thehigherthediscountfactor,thehigher theimportanceofthepresent,andthedecisionmakershouldconcentrateon short-termgains.
Figure1.3TheMarkovdecisionproblemin Example1.3.
Comment1.20 Intheproofof Theorem1.17 weinfactshowedthatthe T -stagepayoffsatisfiesthefollowingformula:
Thus,similartothediscountedpayoff,the T -stagepayoffisaweighted averageofthepayoff r(s1,a1 ) atthefirststageandtheexpectedpayoff γT 1 (s2 ; σs1 ,a1 ) inallsubsequentstages,withweights 1 T and T 1 T .
Example1.3, continued TheMarkovdecisionproblemin Example1.3 is reproducedin Figure1.3.
Theinitialstateis s(1).Thestrategy σD thatalwaysplays D atstate s(1) yieldsapayoff5ateverystage,andthereforeits λ-discountedpayoffis5as well.Letuscalculatethe λ-discountedpayoffofthestrategy σU thatalways plays U atstate s(1).Sincethisstrategyisstationary,
(s(1); σU )
Theterm γλ (s(1); σU ) ontheright-handsideisthediscountedpayofffromthe thirdstageandon,ifatthesecondstagetheplaymovesfrom s(2) to s(1) Eq.(1.13) solvesto
(s(1); σU ) =
For λ = 1(onlythefirstdaymatters),weget γ1 (s(1); σU ) = 10, whilefor λ closeto0(thefarfuturematters),weget