Buy ebook Model-based reinforcement learning milad farsi cheap price by Education Libraries

Model-Based Reinforcement Learning Milad

Farsi

Visit to download the full and correct content document: https://ebookmass.com/product/model-based-reinforcement-learning-milad-farsi/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

The Art of Reinforcement Learning Michael Hu

https://ebookmass.com/product/the-art-of-reinforcement-learningmichael-hu/

The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu

https://ebookmass.com/product/the-art-of-reinforcement-learningfundamentals-mathematics-and-implementations-with-python-1stedition-michael-hu-2/

Deep Reinforcement Learning for Wireless Communications and Networking: Theory, Applications and Implementation Dinh Thai Hoang

https://ebookmass.com/product/deep-reinforcement-learning-forwireless-communications-and-networking-theory-applications-andimplementation-dinh-thai-hoang/

The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu

https://ebookmass.com/product/the-art-of-reinforcement-learningfundamentals-mathematics-and-implementations-with-python-1stedition-michael-hu/

Learning Factories: The Nordic Model of Manufacturing 1st Edition Halvor Holtskog

https://ebookmass.com/product/learning-factories-the-nordicmodel-of-manufacturing-1st-edition-halvor-holtskog/

Quantitative systems pharmacology : models and modelbased systems with applications Davide Manca

https://ebookmass.com/product/quantitative-systems-pharmacologymodels-and-model-based-systems-with-applications-davide-manca/

Computing Possible Futures: Model-Based Explorations of

“What if?”

William B. Rouse

https://ebookmass.com/product/computing-possible-futures-modelbased-explorations-of-what-if-william-b-rouse/

Islam, Civility and Political Culture Milad Milani

https://ebookmass.com/product/islam-civility-and-politicalculture-milad-milani/

Data-Driven and Model-Based Methods for Fault Detection and Diagnosis Majdi Mansouri

https://ebookmass.com/product/data-driven-and-model-basedmethods-for-fault-detection-and-diagnosis-majdi-mansouri/

Model-BasedReinforcementLearning

IEEEPress

445HoesLane

Piscataway,NJ08854

IEEEPressEditorialBoard

SarahSpurgeon, EditorinChief

JónAtliBenediktsson

AnjanBose

AdamDrobot

Peter(Yong)Lian

AndreasMolisch SaeidNahavandi

DiomidisSpinellis

AhmetMuratTekalp

JeffreyReed

ThomasRobertazzi

Model-BasedReinforcementLearning

FromDatatoContinuousActionswithaPython-based Toolbox

MiladFarsiandJunLiu

UniversityofWaterloo,Ontario,Canada

IEEEPressSeriesonControlSystemsTheoryandApplications MariaDomenicaDiBenedetto,SeriesEditor

PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey.

PublishedsimultaneouslyinCanada.

Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinany formorbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise, exceptaspermittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,without eitherthepriorwrittenpermissionofthePublisher,orauthorizationthroughpaymentofthe appropriateper-copyfeetotheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers, MA01923,(978)750-8400,fax(978)750-4470,oronthewebatwww.copyright.com.Requeststo thePublisherforpermissionshouldbeaddressedtothePermissionsDepartment,JohnWiley& Sons,Inc.,111RiverStreet,Hoboken,NJ07030,(201)748-6011,fax(201)748-6008,oronlineat http://www.wiley.com/go/permission.

Trademarks:WileyandtheWileylogoaretrademarksorregisteredtrademarksofJohnWiley& Sons,Inc.and/oritsaffiliatesintheUnitedStatesandothercountriesandmaynotbeused withoutwrittenpermission.Allothertrademarksarethepropertyoftheirrespectiveowners. JohnWiley&Sons,Inc.isnotassociatedwithanyproductorvendormentionedinthisbook.

LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbest effortsinpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttothe accuracyorcompletenessofthecontentsofthisbookandspecificallydisclaimanyimplied warrantiesofmerchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedor extendedbysalesrepresentativesorwrittensalesmaterials.Theadviceandstrategiescontained hereinmaynotbesuitableforyoursituation.Youshouldconsultwithaprofessionalwhere appropriate.Neitherthepublishernorauthorshallbeliableforanylossofprofitoranyother commercialdamages,includingbutnotlimitedtospecial,incidental,consequential,orother damages.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechanged ordisappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublisher norauthorsshallbeliableforanylossofprofitoranyothercommercialdamages,includingbut notlimitedtospecial,incidental,consequential,orotherdamages.

Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,please contactourCustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidethe UnitedStatesat(317)572-3993orfax(317)572-4002.

Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsin printmaynotbeavailableinelectronicformats.FormoreinformationaboutWileyproducts, visitourwebsiteatwww.wiley.com.

LibraryofCongressCataloging-in-PublicationDataappliedfor:

HardbackISBN:9781119808572

CoverDesign:Wiley

CoverImages:©Pobytov/GettyImages;Login/Shutterstock;SazhnievaOksana/Shutterstock

Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India

Contents

AbouttheAuthors xi

Preface xiii

Acronyms xv Introduction xvii

1NonlinearSystemsAnalysis 1

1.1Notation 1

1.2NonlinearDynamicalSystems 2

1.2.1RemarksonExistence,Uniqueness,andContinuationofSolutions 2

1.3LyapunovAnalysisofStability 3

1.4StabilityAnalysisofDiscreteTimeDynamicalSystems 7

1.5Summary 10 Bibliography 10

2OptimalControl 11

2.1ProblemFormulation 11

2.2DynamicProgramming 12

2.2.1PrincipleofOptimality 12

2.2.2Hamilton–Jacobi–BellmanEquation 14

2.2.3ASufficientConditionforOptimality 15

2.2.4Infinite-HorizonProblems 16

2.3LinearQuadraticRegulator 18

2.3.1DifferentialRiccatiEquation 18

2.3.2AlgebraicRiccatiEquation 23

2.3.3ConvergenceofSolutionstotheDifferentialRiccatiEquation 26

2.3.4ForwardPropagationoftheDifferentialRiccatiEquationforLinear QuadraticRegulator 28

2.4Summary 30 Bibliography 30

3ReinforcementLearning 33

3.1Control-AffineSystemswithQuadraticCosts 33

3.2ExactPolicyIteration 35

3.2.1LinearQuadraticRegulator 39

3.3PolicyIterationwithUnknownDynamicsandFunction Approximations 41

3.3.1LinearQuadraticRegulatorwithUnknownDynamics 46

3.4Summary 47 Bibliography 48

4LearningofDynamicModels 51

4.1Introduction 51

4.1.1AutonomousSystems 51

4.1.2ControlSystems 51

4.2ModelSelection 52

4.2.1Gray-Boxvs.Black-Box 52

4.2.2Parametricvs.Nonparametric 52

4.3ParametricModel 54

4.3.1ModelinTermsofBases 54

4.3.2DataCollection 55

4.3.3LearningofControlSystems 55

4.4ParametricLearningAlgorithms 56

4.4.1LeastSquares 56

4.4.2RecursiveLeastSquares 57

4.4.3GradientDescent 59

4.4.4SparseRegression 60

4.5PersistenceofExcitation 60

4.6PythonToolbox 61

4.6.1Configurations 62

4.6.2ModelUpdate 62

4.6.3ModelValidation 63

4.7ComparisonResults 64

4.7.1ConvergenceofParameters 65

4.7.2ErrorAnalysis 67

4.7.3RuntimeResults 69

4.8Summary 73 Bibliography 75

5StructuredOnlineLearning-BasedControlof Continuous-TimeNonlinearSystems 77

5.1Introduction 77

5.2AStructuredApproximateOptimalControlFramework 77

5.3LocalStabilityandOptimalityAnalysis 81

5.3.1LinearQuadraticRegulator 81

5.3.2SOLControl 82

5.4SOLAlgorithm 83

5.4.1ODESolverandControlUpdate 84

5.4.2IdentifiedModelUpdate 85

5.4.3DatabaseUpdate 85

5.4.4LimitationsandImplementationConsiderations 86

5.4.5AsymptoticConvergencewithApproximateDynamics 87

5.5SimulationResults 87

5.5.1SystemsIdentifiableinTermsofaGivenSetofBases 88

5.5.2SystemstoBeApproximatedbyaGivenSetofBases 91

5.5.3ComparisonResults 98

5.6Summary 99 Bibliography 99

6AStructuredOnlineLearningApproachtoNonlinear TrackingwithUnknownDynamics 103

6.1Introduction 103

6.2AStructuredOnlineLearningforTrackingControl 104

6.2.1StabilityandOptimalityintheLinearCase 108

6.3Learning-basedTrackingControlUsingSOL 111

6.4SimulationResults 112

6.4.1TrackingControlofthePendulum 113

6.4.2SynchronizationofChaoticLorenzSystem 114

6.5Summary 115 Bibliography 118

7PiecewiseLearningandControlwithStability Guarantees 121

7.1Introduction 121

7.2ProblemFormulation 122

7.3ThePiecewiseLearningandControlFramework 122

7.3.1SystemIdentification 123

7.3.2Database 124

7.3.3FeedbackControl 125

7.4AnalysisofUncertaintyBounds 125

7.4.1QuadraticProgramsforBoundingErrors 126

7.5StabilityVerificationforPiecewise-AffineLearningandControl 129

7.5.1PiecewiseAffineModels 129

7.5.2MIQP-basedStabilityVerificationofPWASystems 130

7.5.3ConvergenceofACCPM 133

7.6NumericalResults 134

7.6.1PendulumSystem 134

7.6.2DynamicVehicleSystemwithSkidding 138

7.6.3ComparisonofRuntimeResults 140

7.7Summary 142 Bibliography 143

8AnApplicationtoSolarPhotovoltaicSystems 147

8.1Introduction 147

8.2ProblemStatement 150

8.2.1PVArrayModel 151

8.2.2DC-DCBoostConverter 152

8.3OptimalControlofPVArray 154

8.3.1MaximumPowerPointTrackingControl 156

8.3.2ReferenceVoltageTrackingControl 162

8.3.3PiecewiseLearningControl 164

8.4ApplicationConsiderations 165

8.4.1PartialDerivativeApproximationProcedure 165

8.4.2PartialShadingEffect 167

8.5SimulationResults 170

8.5.1ModelandControlVerification 173

8.5.2ComparativeResults 174

8.5.3Model-FreeApproachResults 176

8.5.4PiecewiseLearningResults 178

8.5.5PartialShadingResults 179

8.6Summary 182 Bibliography 182

9AnApplicationtoLow-levelControlofQuadrotors 187

9.1Introduction 187

9.2QuadrotorModel 189

9.3StructuredOnlineLearningwithRLSIdentifieronQuadrotor 190

9.3.1LearningProcedure 191

9.3.2AsymptoticConvergencewithUncertainDynamics 195

9.3.3ComputationalProperties 195

9.4NumericalResults 197

9.5Summary 201 Bibliography 201

10PythonToolbox 205

10.1Overview 205

10.2UserInputs 205

10.2.1Process 206

10.2.2Objective 207

10.3SOL 207

10.3.1ModelUpdate 208

10.3.2Database 208

10.3.3Library 210

10.3.4Control 210

10.4DisplayandOutputs 211

10.4.1GraphsandPrintouts 213

10.4.23DSimulation 213

10.5Summary 214 Bibliography 214

AAppendix 215

A.1SupplementaryAnalysisofRemark5.4 215

A.2SupplementaryAnalysisofRemark5.5 222

Index 223

AbouttheAuthors

MiladFarsi receivedaB.S.degreeinElectricalEngineering(Electronics)from theUniversityofTabrizin2010.HeobtainedanM.S.degreealsoinElectrical Engineering(ControlSystems)fromSahandUniversityofTechnologyin2013. Moreover,hegainedindustrialexperienceasaControlSystemEngineerbetween 2012and2016.Later,heacquiredaPh.D.degreeinAppliedMathematicsfromthe UniversityofWaterloo,Canada,in2022andiscurrentlyaPostdoctoralFellowat thesameinstitution.Hisresearchinterestsincludecontrolsystems,reinforcement learning,andtheirapplicationsinroboticsandpowerelectronics.

JunLiu receivedaB.S.degreeinAppliedMathematicsfromShanghaiJiao-Tong Universityin2002,theM.S.degreeinMathematicsfromPekingUniversity in2005,andthePh.D.degreeinAppliedMathematicsfromtheUniversityof Waterloo,Canada,in2010.HeiscurrentlyanAssociateProfessorofApplied MathematicsandaCanadaResearchChairinHybridSystemsandControlat theUniversityofWaterloo,wherehedirectstheHybridSystemsLaboratory. From2012to2015,hewasaLecturerinControlandSystemsEngineeringat theUniversityofSheffield.During2011and2012,hewasaPostdoctoralScholar inControlandDynamicalSystemsattheCaliforniaInstituteofTechnology. Hismainresearchinterestsareinthetheoryandapplicationsofhybridsystems andcontrol,includingrigorouscomputationalmethodsforcontroldesignwith applicationsincyber-physicalsystemsandrobotics.

Preface

ThesubjectofReinforcementLearning(RL)ispopularlyassociatedwiththepsychologyofanimallearningthroughatrial-and-errormechanism.Theunderlying mathematicalprincipleofRLtechniques,however,isundeniablythetheoryof optimalcontrol,asexemplifiedbylandmarkresultsinthelate1950sondynamic programmingbyBellman,themaximumprinciplebyPontryagin,andtheLinear QuadraticRegulator(LQR)byKalman.Optimalcontrolitselfhasitsrootsinthe mucholdersubjectofcalculusofvariations,whichdatedbacktolate1600s.Pontryagin’smaximumprincipleandtheHamilton–Jacobi–Bellman(HJB)equation arethetwomainpillarsofoptimalcontrol,thelatterofwhichprovidesfeedback controlstrategiesthroughanoptimalvaluefunction,whereastheformercharacterizesopen-loopcontrolsignals.

ReinforcementlearningwasdevelopedbyBartoandSuttoninthe1980s, inspiredbyanimallearningandbehavioralpsychology.Thesubjecthasexperiencedaresurgenceofinterestinbothacademiaandindustryoverthepast decade,amongthenewexplosivewaveofAIandmachinelearningresearch.A notablerecentsuccessofRLwasintacklingtheotherwiseseeminglyintractable gameofGoanddefeatingtheworldchampionin2016.

Arguably,theproblemsoriginallysolvedbyRLtechniquesaremostlydiscretein nature.Forexample,navigatingmazesandplayingvideogames,whereboththe statesandactionsarediscrete(finite),orsimplecontroltaskssuchaspolebalancingwithimpulsiveforces,wheretheactions(controls)arechosentobediscrete. Morerecently,researchersstartedtoinvestigateRLmethodsforproblemswith bothcontinuousstateandactionspaces.Ontheotherhand,classicaloptimalcontrolproblemsbydefinitionhavecontinuousstateandcontrolvariables.Itseems naturaltosimplyformulateoptimalcontrolproblemsinamoregeneralwayand developRLtechniquestosolvethem.Nonetheless,therearetwomainchallenges insolvingsuchoptimalcontrolproblemsfromacomputationalperspective.First, mosttechniquesrequireexactoratleastapproximatemodelinformation.Second, thecomputationofoptimalvaluefunctionsandfeedbackcontrolsoftensuffers

xiv Preface

fromthecurseofdimensionality.Asaresult,suchmethodsareoftentooslowto beappliedinanonlinefashion.

Thebookwasmotivatedbythisverychallengeofdevelopingcomputationally efficientmethodsforonlinelearningoffeedbackcontrollersforcontinuous controlproblems.AmainpartofthisbookwasbasedonthePhDthesisofthe firstauthor,whichpresentedaStructuredOnlineLearning(SOL)framework forcomputingfeedbackcontrollersbyforwardintegrationofastate-dependent differentialRiccatiequationalongstatetrajectories.InthespecialcaseofLinear Time-Invariant(LTI)systems,thisreducestosolvingthewell-knownLQRproblemwithoutpriorknowledgeofthemodel.Thefirstpartofthebook(Chapters 1–3)providessomebackgroundmaterialsincludingLyapunovstabilityanalysis, optimalcontrol,andRLforcontinuouscontrolproblems.Theremainingpart (Chapters4–9)discussestheSOLframeworkindetail,coveringbothregulation andtrackingproblems,theirfurtherextensions,andvariouscasestudies.

Thefirstauthorwouldliketoconveyhisheartfeltthankstothosewhoencouragedandsupportedhimduringhisresearch.Thesecondauthorisgratefulto thementors,students,colleagues,andcollaboratorswhohavesupportedhim throughouthiscareer.Wegratefullyacknowledgefinancialsupportforthe researchthroughtheNaturalSciencesandEngineeringResearchCouncilof Canada,theCanadaResearchChairsProgram,andtheOntarioEarlyResearcher AwardProgram.

Waterloo,Ontario,Canada April2022

MiladFarsiandJunLiu

Acronyms

ACCPManalyticcentercutting-planemethod

AREalgebraicRiccatiequation

DNNdeepneuralnetwork

DPdynamicprogramming

DREdifferentialRiccatiequation

FPREforward-propagatingRiccatiequation

GDgradientdescent

GUASgloballyuniformlyasymptoticallystable

GUESgloballyuniformlyexponentiallystable

HJBHamilton–Jacobi–Bellman

KDEkerneldensityestimation

LMSleastmeansquares

LQRlinearquadraticregulator

LQTlinearquadratictracking

LSleastsquare

LTIlineartime-invariant

MBRLmodel-basedreinforcementlearning

MDPMarkovdecisionprocess

MIQPmixed-integerquadraticprogram

MPCmodelpredictivecontrol

MPPmaximumpowerpoint

MPPTmaximumpowerpointtracking

NNneuralnetwork

ODEordinarydifferentialequation

PDEpartialdifferentialequation

PEpersistenceofexcitation

PIpolicyiteration

PVPhotovoltaic

PWApiecewiseaffine

xvi Acronyms

PWMpulse-widthmodulation

RLreinforcementlearning

RLSrecursiveleastsquares

RMSErootmeansquareerror

ROAregionofattraction

SDREstate-dependentRiccatiequations

SINDysparseidentificationofnonlineardynamics

SMCslidingmodecontrol

SOLstructuredonlinelearning

SOSsumofsquares

TDtemporaldifference

UASuniformlyasymptoticallystable

UESuniformlyexponentiallystable

VIvalueiteration

Introduction

I.1BackgroundandMotivation

I.1.1LackofanEfﬁcientGeneralNonlinearOptimalControl Technique

Optimalcontroltheoryplaysanimportantroleindesigningeffectivecontrol systems.Forlinearsystems,aclassofoptimalcontrolproblemsaresolved successfullyundertheframeworkofLinearQuadraticRegulator(LQR).LQR problemsareconcernedwithminimizingaquadraticcostforlinearsystemsin termsofthecontrolinputandstate,solvingwhichallowsustoregulatethestate andthecontrolinputofthesystem.Incontrolsystemsapplications,thisprovides anopportunitytospecificallyregulatethebehaviorofthesystembyadjusting theweightingcoefficientsusedinthecostfunctional.However,whenitturns tononlineardynamicalsystems,thereisnosystematicmethodforefficiently obtaininganoptimalfeedbackcontrolforthegeneralnonlinearsystems.Thus, manyofthetechniquesavailableintheliteratureonlinearsystemsdonotapply ingeneral.

Despitethecomplexityofnonlineardynamicalsystems,theyhaveattracted muchattentionfromresearchersinrecentyears.Thisismostlybecauseoftheir practicalbenefitsinestablishingawidevarietyofapplicationsinengineering, includingpowerelectronics,flightcontrol,androbotics,amongmanyothers. Consideringthecontrolofageneralnonlineardynamicalsystem,optimalcontrol involvesfindingacontrolinputthatminimizesacostfunctionalthatdepends onthecontrolledstatetrajectoryandthecontrolinput.Whilesuchaproblem formulationcancoverawiderangeofapplications,howtoefficientlysolvesuch problemsremainsatopicofactiveresearch.

I.1.2ImportanceofanOptimalFeedbackControl

Ingeneral,thereexisttwowell-knownapproachestosolvingsuchoptimalcontrolproblems:themaximum(orminimum)principles[Pontryagin,1987]andthe DynamicProgramming(DP)method[BellmanandDreyfus,1962].Tosolvean optimizationproblemthatinvolvesdynamics,maximumprinciplesrequireusto solveatwo-pointboundaryvalueproblem,wherethesolutionisnotinafeedbackform.

Thereexistplentyofnumericaltechniquespresentedintheliteraturetosolve theoptimalcontrolproblem.Suchapproachesgenerallyrelyonknowledgeof theexactmodelofthesystem.Inthecasewheresuchamodelexists,theoptimal controlinputisobtainedintheopen-loopformasatime-dependentsignal. Consequently,implementingtheseapproachesinreal-worldproblemsoften involvesmanycomplicationsthatarewellknownbythecontrolcommunity.This isbecauseofthemodelmismatch,noises,anddisturbancesthatgreatlyaffect theonlinesolution,causingittodivergefromthepreplannedofflinesolution. Therefore,obtainingaclosed-loopsolutionfortheoptimalcontrolproblemis oftenpreferredinsuchapplications.

TheDPapproachanalyticallyresultsinafeedbackcontrolforlinearsystems withaquadraticcost.Moreover,employingtheHamilton-Jacobi-Bellman(HJB) equationwithavaluefunction,onemightmanagetoderiveanoptimalfeedback controlruleforsomereal-worldapplications,providedthatthevaluefunction canbeupdatedinanefficientmanner.Thismotivatesustoconsiderconditions leadingtoanoptimalfeedbackcontrolrulethatcanbeefficientlyimplementedin real-worldproblems.

I.1.3LimitsofOptimalFeedbackControlTechniques

Consideranoptimalcontrolproblemoveraninfinitehorizoninvolvinganonquadraticperformancemeasure.Usingtheideaofinverseoptimalcontrol,the costfunctionalcanbethenbeevaluatedinclosedformaslongastherunning costdependssomehowonanunderlyingLyapunovfunctionbywhichtheasymptoticstabilityofthenonlinearclosed-loopsystemisguaranteed.Thenitcanbe obtainedthattheLyapunovfunctionisindeedthesolutionofthesteady-stateHJB equation.Althoughsuchaformulationallowsanalyticallyobtaininganoptimal feedbackrule,choosingtheproperperformancemeasuremaynotbetrivial.Moreover,fromapracticalpointofview,becauseofthenonlinearityintheperformance measure,itmightcauseunpredictablebehavior.

Awell-studiedmethodforsolvinganoptimalcontrolproblemonlineisemployingavaluefunctionassumingagivenpolicy.Then,foranystate,thevaluefunction givesameasureofhowgoodthestateisbycollectingthecoststartingfromthat statewhilethepolicyisapplied.Ifsuchavaluefunctioncanbeobtained,andthe systemmodelisknown,theoptimalpolicyisactuallytheonethattakesthesystem inthedirectionbywhichthevaluedecreasesthemostinthespaceofthestates. SuchReinforcementLearning(RL)techniques,whichareknownasvalue-based methods,includingtheValueIteration(VI)andthePolicyIteration(PI)algorithms,areshowntobeeffectiveinfinitestateandcontrolspaces.However,the computationscannotefficientlyscalewiththesizeofthestateandcontrolspaces.

I.1.4ComplexityofApproximateDPAlgorithms

Onewayoffacilitatingthecomputationsregardingthevalueupdatesisemployinganapproximatescheme.Thisisdonebyparameterizingthevaluefunction andadjustingtheparametersinthetrainingprocess.Then,theoptimalpolicy givenbythevaluefunctionisalsoparameterizedandapproximatedaccordingly. Thecomplexityofanyvalueupdatedependsdirectlyonthenumberofparameters employed,whereonemaytrylimitingthenumberoftheparametersbysacrificingtheoptimality.Therefore,wearemotivatedtoobtainamoreefficientupdate ruleforthevalueparameters,ratherthanlimitingthenumberoftheparameters. Weachievethisbyreformulatingtheproblemwithaquadraticallyparameterized valuefunction.

Moreover,theclassicalVIalgorithmdoesnotexplicitlyusethesystemmodel forevaluatingthepolicy.Thisbenefitsapplicationsinthatthefullknowledge ofthesystemdynamicsisnolongerrequired.However,onlinetrainingwithVI alonemaytakemuchlongertimetoconverge,sincethemodelonlyparticipates implicitlythroughthefuturestate.Therefore,thelearningprocesscanbepotentiallyacceleratedbyintroducingthesystemmodel.Furthermore,thiscreatesan opportunityforrunningaseparateidentifierunit,wherethemodelobtainedcan besimulatedofflinetocompletethetrainingorcanbeusedforlearningoptimal policiesfordifferentobjectives.

ItcanbeshownthattheVIalgorithmforlinearsystemsresultsinaLyapunov recursioninthepolicyevaluationstep.SuchaLyapunovequationintermsof thesystemmatricescanbeefficientlysolved.However,forthegeneralnonlinear case,methodsforobtaininganequivalentarenotamenabletoefficientsolutions. Hence,wearemotivatedtoinvestigatethepossibilityofacquiringanefficient updaterulefornonlinearsystems.

I.1.5ImportanceofLearning-basedTrackingApproaches

Oneofthemostcommonproblemsinthecontrolofdynamicalsystemsistotracka desiredreferencetrajectory,whichisfoundinavarietyofreal-worldapplications. However,designinganefficienttrackingcontrollerusingconventionalmethods oftennecessitatesathoroughunderstandingofthemodel,aswellascomputationsandconsiderationsforeachapplication.RLapproaches,ontheotherhand, proposeamoreflexibleframeworkthatrequireslessinformationaboutthesystem dynamics.Whilethismaycreateadditionalproblems,suchassafetyorcomputinglimits,therearealreadyeffectiveoutcomesfromtheuseofsuchapproachesin real-worldsituations.Similartoregulationproblems,theapplicationsoftracking controlcanbenefitfromModel-basedReinforcementLearning(MBRL)thatcan handletheparameterupdatesmoreefficiently.

I.1.6OpportunitiesforObtainingaReal-timeControl

Intheapproximateoptimalcontroltechnique,employingalimitednumberof parameterscanonlyyieldalocalapproximationofthemodelandthevaluefunction.However,ifanapproximationwithinalargerdomainisintended,aconsiderablyhighernumberofparametersmaybeneeded.Asaresult,theidentification andthecontroller’scomplexitymightberathertoohightobeperformedonline inreal-worldapplications.Thisconvincesustocircumventthisconstraintbyconsideringasetoflocalsimplelearnersinstead,inapiecewiseapproach.

Asmentioned,thereexistalreadyinterestingreal-worldapplicationsofMBRL. Motivatedbythis,inthismonograph,weaimonintroducingautomatedwaysof solvingoptimalcontrolproblemsthatcanreplacetheconventionalcontrollers. Hence,detailedapplicationsoftheproposedapproachesareincluded,whichare demonstratedwithnumericalsimulations.

I.1.7Summary

Themainmotivationforthismonographcanbesummarizedasfollows:

● Optimalcontrolishighlyfavored,whilethereisnogeneralanalyticaltechnique applicabletoallnonlinearsystems.

● Feedbackcontroltechniquesareknowntobemorerobustandcomputationally efficientcomparedtothenumericaltechniques,especiallyinthecontinuous space.

● Thechanceofobtainingafeedbackcontrolinclosedformislow,andtheknown techniquesarelimitedtosomespecialclassesofsystems.

● ApproximateDPprovidesasystematicwayofobtaininganoptimalfeedback control,whilethecomplexitygrowssignificantlywiththenumberofparameters.

● Anefficientparameterizationoftheoptimalvaluemayprovideanopportunityformorecomplexreal-timeapplicationsincontrolregulationandtracking problems.

I.1.8OutlineoftheBook

Wesummarizethemaincontentsofthebookasfollows:

● Chapter1introducesLyapunovstabilityanalysisofnonlinearsystems,which areusedinsubsequentchaptersforanalyzingtheclosed-loopperformanceof thefeedbackcontrollers.

● Chapter2formulatestheoptimalcontrolproblemandintroducesthebasicconceptsofusingtheHJBequationtocharacterizeoptimalfeedbackcontrollers, whereLQRistreatedasaspecialcase.Afocusisonoptimalfeedbackcontrollersforasymptoticstabilizationtaskswithaninfinite-horizonperformance criterion.

● Chapter3discussesPIasaprominentRLtechniqueforsolvingcontinuousoptimalcontrolproblems.PIalgorithmsforbothlinearandnonlinearsystemswith andwithoutanyknowledgeofthesystemmodelarediscussed.Proofsofconvergenceandstabilityanalysisareprovidedinaself-containedmanner.

● Chapter4presentsdifferenttechniquesforlearningadynamicmodelfor continuouscontrolintermsofasetofbasisfunctions,includingleastsquares, recursiveleastsquares,gradientdescent,andsparseidentificationtechniques forparameterupdates.Comparisonresultsareshownusingnumerical examples.

● Chapter5introducestheStructuredOnlineLearning(SOL)frameworkforcontrol,includingthealgorithmandlocalanalysisofstabilityandoptimality.The focusisonregulationproblems.

● Chapter6extendstheSOLframeworktotrackingwithunknowndynamics. SimulationresultsaregiventoshowtheeffectivenessoftheSOLapproach. NumericalresultsoncomparisonwithalternativeRLapproachesarealso shown.

● Chapter7presentsapiecewiselearningframeworkworkasafurtherextension oftheSOLapproach,wherewelimittolinearbases,whileallowingmodelsto belearnedinapiecewisefashion.Accordingly,closed-loopstabilityguarantees areprovidedwithLyapunovanalysisfacilitatedbyMixed-IntegerQuadraticProgram(MIQP)-basedverification.

● Chapters8and9presenttwocasestudiesonPhotovoltaic(PV)andquadrotor systems.Chapter10introducestheassociatedPython-basedtoolforSOL.

Itshouldbenotedthat,someofthecontentsofchapters5–9havebeenpreviouslypublishedinFarsiandLiu[2020,2021],Farsietal.[2022],FarsiandLiu [2022b,2019],andtheyareincludedinthisbookwiththepermissionofthecited publishers.

I.2LiteratureReview

I.2.1ReinforcementLearning

RLisawell-knownclassofmachinelearningmethodsthatareconcernedwith learningtoachieveaparticulartaskthroughinteractionswiththeenvironment. Thetaskisoftendefinedbysomerewardmechanism.Theintelligentagenthas totakeactionsindifferentsituations.Then,therewardaccumulatedisusedasa measuretoimprovetheagent’sactionsinfuture,wheretheobjectiveistoaccumulateasmuchasrewardsaspossibleoversometime.Therefore,itisexpectedthat theagent’sactionsapproachtheoptimalbehaviorinalongterm.RLhasgaineda lotofsuccessesinthesimulationenvironment.However,thelackofexplainability [Dulac-Arnoldetal.,2019]anddataefficiency[Duanetal.,2016]makethem lessfavorableasanonlinelearningtechniquethatcanbedirectlyemployedin real-worldproblems,unlessthereexistsawaytosafelytransfertheexperience fromsimulation-basedlearningtotherealworld.Themainchallengesinthe implementationsoftheRLtechniquesarediscussedinDulac-Arnoldetal.[2019]. Numerousstudiesaredoneonthissubject;see,e.g.SuttonandBarto[2018], WieringandVanOtterlo[2012],Kaelblingetal.[1996],andArulkumaranetal. [2017]foralistofrelatedworks.RLhasfoundavarietyofinterestingapplications inrobotics[Koberetal.,2013],multiagentsystems[Zhangetal.,2021;DaSilva andCosta,2019;Hernandez-Lealetal.,2019],powersystems[Zhangetal.,2019; Yangetal.,2020],autonomousdriving[Kiranetal.,2021]andintelligenttransportation[HaydariandYilmaz,2020],andhealthcare[Yuetal.,2021],among others.

I.2.2Model-basedReinforcementLearning

MBRLtechniques,asopposedtomodel-freemethodsinlearning,areknown tobemoredataefficient.Directmodel-freemethodsusuallyrequireenormous dataandhoursoftrainingevenforsimpleapplications[Duanetal.,2016],

whilemodel-basedtechniquescanshowoptimalbehaviorinalimitednumber oftrials.Thisproperty,inadditiontotheflexibilitiesinchanginglearning objectivesandperformingfurthersafetyanalysis,makesthemmoresuitablefor real-worldimplementations,suchasrobotics[PolydorosandNalpantidis,2017].

Inmodel-basedapproaches,havingadeterministicorprobabilisticdescriptionof thetransitionsystemsavesmuchoftheeffortspentbydirectmethodsintreating anypointinthestate-controlspaceindividually.Hence,theroleofmodel-based techniquesbecomesevenmoresignificantwhenitcomestoproblemswith continuouscontrolsratherthandiscreteactions[Sutton,1990;Atkesonand Santamaria,1997;Powell,2004].

InMoerlandetal.[2020],theauthorsprovideasurveyofsomerecentMBRL methodswhichareformulatedbasedonMarkovDecisionProcesses(MDPs).In general,thereexisttwoapproachesforapproximatingasystem:parametricand nonparametric.Parametricmodelsareusuallypreferredovernonparametric, sincethenumberoftheparametersisindependentofthenumberofsamples. Therefore,theycanbeimplementedmoreefficientlyoncomplexsystems,where manysamplesareneeded.Ontheotherhand,innonparametricapproaches,the predictionforagivensampleisobtainedbycomparingitwithasetofsamples alreadystored,whichrepresentthemodel.Therefore,thecomplexityincreases withthesizeofthedataset.Inthisbook,becauseofthisadvantageofparametric models,wefocusontheparametrictechniques.

I.2.3OptimalControl

LetusspecificallyconsiderimplementationsofRLoncontrolsystems.Regardless ofthefactthatRLtechniquesdonotrequirethedynamicalmodeltosolvethe problem,theyareinfactintendedtofindasolutionfortheoptimalcontrolproblem.Thisproblemisextensivelyinvestigatedbythecontrolcommunity.TheLQR problemhasbeensolvedsatisfactorilyforlinearsystemsusingRiccatiequations [Kalman,1960],whichalsoensuresystemstabilityforinfinite-horizonproblems. However,inthecaseofnonlinearsystems,obtainingsuchasolutionisnottrivialandrequiresustosolvetheHJBequation,eitheranalyticallyornumerically, whichisachallengingtask,especiallywhenwedonothaveknowledgeofthe systemmodel.

ModelPredictiveControl(MPC)[CamachoandAlba,2013;Garciaetal.,1989; QinandBadgwell,2003;GrüneandPannek,2017;Garciaetal.,1989;Mayneand Michalska,1988;MorariandLee,1999]hasbeenfrequentlyusedasanoptimal controltechnique,whichisinherentlymodel-based.Furthermore,itdealswith thecontrolproblemonlyacrossarestrictedpredictionhorizon.Forthisreason,

andforthefactthattheproblemisnotconsideredintheclosed-loopform,stability analysisishardtoestablish.Forthesamereasons,theonlinecomputationalcomplexityisconsiderablyhigh,comparedtoafeedbackcontrolrulethatcanbeefficientlyimplemented.

Forward-PropagatingRiccatiEquation(FPRE)[Weissetal.,2012;Prachetal., 2015]isoneofthetechniquespresentedforsolvingtheLQRproblem.Normally, theDifferentialRiccatiEquation(DRE)issolvedbackwardfromafinalcondition. Inananalogoustechnique,itcanbesolvedinforwardtimewithsomeinitialconditioninstead.AcomparisonbetweenthesetwoschemesisgiveninPrachetal. [2015].Employingforward-integrationmethodsmakesitsuitableforsolvingthe problemfortime-varyingsystems[Weissetal.,2012;ChenandKao,1997]orinthe RLsetting[Lewisetal.,2012],sincethefuturedynamicsarenotneeded,whereas thebackwardtechniquerequirestheknowledgeofthefuturedynamicsfromthe finalcondition.FPREhasbeenshowntobeanefficienttechniqueforfindinga suboptimalsolutionforlinearsystems,while,fornonlinearsystems,theassumptionisthatthesystemislinearizedalongthesystem’strajectories.

State-dependentRiccatiEquations(SDRE)[Çimen,2008;ErdemandAlleyne, 2004;Cloutier,1997]isanothertechniquethatcanbefoundintheliteraturefor solvingtheoptimalcontrolproblemfornonlinearsystems.Thistechniquerelies onthefactthatanynonlinearsystemcanbewrittenintheformofalinearsystem withstate-dependentmatrices.However,thisconversionisnotunique.Hence,a suboptimalsolutionisexpected.SimilartoMPC,itdoesnotyieldafeedbackcontrolrulesincethecontrolateachstateiscomputedbysolvingaDREthatdepends onthesystem’strajectory.

I.2.4DynamicProgramming

Othermodel-basedapproachescanbefoundintheliteraturethataremainly categorizedunderRLintwogroups:valuefunctionandpolicysearchmethods. Invaluefunction-basedmethods,knownalsoasapproximate/adaptiveDP techniques[Wangetal.,2009;LewisandVrabie,2009;Balakrishnanetal., 2008],avaluefunctionisusedtoconstructthepolicy.Ontheotherhand, policysearchmethodsdirectlyimprovethepolicytoachieveoptimality.AdaptiveDPhasfounddifferentapplications[Prokhorov,2008;Ferrari-Trecate etal.,2003;Prokhorovetal.,1995;Murrayetal.,2002;Yuetal.,2014;Han andBalakrishnan,2002;Lendarisetal.,2000;LiuandBalakrishnan,2000; Ferrari-Trecateetal.,2003]inautomotivecontrol,flightcontrol,powercontrol, amongothers.AreviewofrecenttechniquescanbefoundinKalyanakrishnan andStone[2009],Busoniuetal.[2017],Recht[2019],PolydorosandNalpantidis

[2017],andKamalapurkaretal.[2018].TheQ-learningapproachlearnsan action-dependentfunctionusingTemporalDifference(TD)toobtaintheoptimal policy.Thisisinherentlyadiscreteapproach.Therearecontinuousextensions ofthistechnique,suchas[Millánetal.,2002;Gaskettetal.,1999;Ryuetal., 2019;Weietal.,2018].However,foranefficientimplementation,thestate andactionspacesoughttobefinite,whichishighlyrestrictiveforcontinuous problems.

Adaptivecontrollers[ÅströmandWittenmark,2013],asawell-knownclass ofcontroltechniques,mayseemsimilartoRLinmethodology,whilethereare substantialdifferencesintheproblemformulationandobjectives.Adaptivetechniques,aswellasRL,learntoregulateunknownsystemsutilizingdatacollected inrealtime.Infact,anRLtechniquecanbeseenasanadaptivetechniquethat convergestotheoptimalcontrol[LewisandVrabie,2009].However,asopposed toRLandoptimalcontrollers,adaptivecontrollersarenotnormallyintendedto beoptimal,withrespecttoauser-specifiedcostfunction.Hence,wewillnotdraw directcomparisonswithsuchmethods.

ValuemethodsinRLnormallyrequiresolvingthewell-knownHJB.However,commontechniquesforsolvingsuchequationssufferfromthecurseof dimensionality.Hence,inapproximateDPtechniques,aparametricornonparametricmodelisusedtoapproximatethesolution.InLewisandVrabie[2009], somerelatedapproachesarereviewedthatfundamentallyfollowtheactor-critic structure[Bartoetal.,1983],suchasVIandPIalgorithms.

Insuchapproaches,theBellmanerror,whichisobtainedfromtheexplorationof thestatespace,isusedtoimprovetheparametersestimatedinagradient-descent orleast-squaresloopthatrequiresthePersistenceofExcitation(PE)condition. SincetheBellmanerrorobtainedisonlyvalidalongthetrajectoriesofthesystem,sufficientexplorationinthestatespaceisrequiredtoefficientlyestimatethe parameters.InKamalapurkaretal.[2018],theauthorshaverevieweddifferent strategiesemployedtoincreasethedataefficiencyinexploration.InVamvoudakis etal.[2012]andModaresetal.[2014],aprobingsignalisaddedtothecontrol toenhancetheexploringpropertiesofthepolicy.Inanotherapproach[Modares etal.,2014],therecordeddataofexplorationsareusedasareplayofexperienceto increasethedataefficiency.Accordingly,themodelobtainedfromidentification isusedtoacquiremoreexperiencebydoingsimulationinanofflineroutinethat decreasestheneedforvisitinganypointinthestatespace.

Asanalternativemethod,consideringanonlinearcontrol-affinesystemwith aknowninputcouplingfunction,theworkKamalapurkaretal.[2016b]used aparametricmodeltoapproximatethevaluefunction.Then,theyemployeda least-squaresminimizationtechniquetoadjusttheparametersaccordingtothe

Bellmanerror,whichcanbecalculatedatanyarbitrarypointofthestatespace usingidentifiedinternaldynamicsofthesystemandapproximatedstatederivativesunderaPE-likerankcondition.InKamalapurkaretal.[2016a],theauthors proposedanimprovedtechnique,whichapproximatesthevaluefunctiononlyin asmallneighborhoodofthecurrentstate.Ithasbeenshownthatthelocalapproximationcanbedonemoreefficientlysinceaconsiderablylessnumberofbasescan beused.

IntheworkbyJiangandJiang[2012],JiangandJiang[2014],andJiangand Jiang[2017],theauthorsproposedPI-basedalgorithmsthatdonotrequireany priorknowledgeofthesystemmodel.AsimilarPE-likerankconditionwasused toensuresufficientexplorationforsuccessfullearningofthevaluefunctionsand controllers.Itisshownthatthesealgorithmscanachievesemiglobalstabilization andconvergencetooptimalvaluesandcontrollers.Oneofthemainlimitations ofPI-basedalgorithmsisthataninitialstabilizingcontrollerhastobeprovided. WhileconvergentPIalgorithmshavebeenrecentlyprovedfordiscrete-timesystemsinmoregeneralsettings[Bertsekas,2017],itsextensiontocontinuous-time systemsinvolvessubstantialtechnicaldifficultiesaspointedoutbyBertsekas [2017].

I.2.5PiecewiseLearning

Thereexistdifferenttechniquestoefficientlyfitapiecewisemodeltodata, see,e.g.TorielloandVielma[2012],Breschietal.[2016],Ferrari-Trecateetal. [2003],Amaldietal.[2016],RebennackandKrasko[2020],andDuetal.[2021]. InFerrari-Trecateetal.[2003],atechniquefortheidentificationofdiscrete-time hybridsystemsbythepiecewiseaffinemodelispresented.Thealgorithmcombinesclustering,linearidentification,andpatternrecognitionapproachesto identifytheaffinesubsystemstogetherwiththepartitionsforwhichtheyapply. Infact,theproblemofgloballyfittingapiecewiseaffinemodelisconsideredto becomputationallyexpensivetosolve.InLauer[2015],itisdiscussedthatglobal optimalitycanbereachedwithapolynomialcomplexityinthenumberofdata, whileitisexponentialwithrespecttothedatadimension.Inthisregard,thework byBreschietal.[2016]presentsanefficienttwo-steptechnique:first,recursively clusteringoftheregressorvectorsandestimationofthemodelparameters,and second,computationofapolyhedralpartition.Areviewofsomeofthetechniques canbefoundinGambellaetal.[2021]andGarullietal.[2012].

TheflexibilityofPiecewiseAffine(PWA)systemsmakesthemsuitablefor differentapproachesincontrol.Hence,thecontrolproblemofpiecewisesystems isextensivelystudiedintheliterature;see,e.g.MarcucciandTedrake[2019],Zou

andLi[2007],RodriguesandBoyd[2005],Baotic[2005],Strijboschetal.[2020], Christophersenetal.[2005],andRodriguesandHow[2003].Moreover,various applicationscanbefoundforPWAsystems,includingrobotics[Andrikopoulos etal.,2013;Marcuccietal.,2017],automotivecontrol[Borrellietal.,2006;Sun etal.,2019],andpowerelectronics[Geyeretal.,2008;Vladetal.,2012].In ZouandLi[2007],therobustMPCstrategyisextendedtoPWAsystemswith polytopicuncertainty,wheremultiplePWAquadraticLyapunovfunctionsare employedfordifferentverticesoftheuncertaintypolytopeindifferentpartitions. InanotherworkbyMarcucciandTedrake[2019],hybridMPCisformulatedas amixed-integerprogramtosolvetheoptimalcontrolproblemforPWAsystems. However,thesetechniquesareonlyavailableinanopen-loopform,which decreasestheirapplicabilityforreal-timecontrol.

Ontheotherhand,DeepNeuralNetwork(DNN)offersanefficienttechnique forcontrolinclosedloop.However,onedrawbackofDNN-basedcontrolisthe difficultyinstabilityanalysis.ThisbecomesevenmorechallengingwhenPWAare considered.TheworkbyChenetal.[2020]suggestedasample-efficienttechnique forsynthesizingaLyapunovfunctionforthePWAsystemcontrolledthrougha DNNinclosedloop.Inthisapproach,AnalyticCenterCutting-PlaneMethod (ACCPM)[GoffinandVial,1993;Nesterov,1995;BoydandVandenberghe,2004] isfirstusedforsearchingforaLyapunovfunction.Then,thisLyapunovfunction candidateisverifiedontheclosed-loopsystemusinganMIQP.Thisapproach reliesonourknowledgeoftheexactmodelofthesystem.Hence,itcannotbe directlyimplementedonanidentifiedPWAsystemwithuncertainty.

I.2.6TrackingControl

Forthelearning-basedtrackingproblem,severaltechniquescanbefoundinthe literature,inadditiontosomeextensionspresentedforthetechniquesreviewed byModaresandLewis[2014],Modaresetal.[2015],Zhuetal.[2016],Yangetal. [2016],andLuoetal.[2016].ModaresandLewis[2014]havedevelopedanintegral RLtechniqueforlinearsystemsbasedonPIalgorithm,startingwithanadmissible initialcontroller.Ithasbeenshownthattheoptimaltrackingcontrollerconverges totheLinearQuadraticTracking(LQT)controller,withapartiallyunknownsystem.InModaresetal.[2015],anoff-policymethodisemployedwiththreeneuralnetworksinanactor-critic-disturbanceconfigurationtolearnan H∞ -tracking controllerforunknownnonlinearsystems.Zhuetal.[2016]constructedanaugmentedsystemusingthetrackingerrorandthereference.Neuralnetworkswere employed,inanactor-criticstructure,toapproximatethevaluefunctionandlearn anoptimalpolicy.Inanotherneuralnetwork-basedapproach[Yangetal.,2016],a

singlenetworkwasusedtoapproximatethevaluefunction,whereclassesofuncertaindynamicswereassumed.Inadditiontotheaboveapproaches,thereexistother similaronesintheliterature.However,applicationsofRLintrackingcontrolare notonlylimitedtomodel-basedtechniques.Forinstance,Luoetal.[2016]suggests acritic-onlyQ-learningapproachfortrackingproblems,whichdoesnotrequire solvingtheHJBequation.

I.2.7Applications

Asmentioned,thereexistdifferentapplicationsofMBRL,aswellastheoptimal control,onreal-worldproblems[Prokhorov,2008;Ferrari-Trecateetal.,2003; Prokhorovetal.,1995;Murrayetal.,2002;Yuetal.,2014;HanandBalakrishnan, 2002;Lendarisetal.,2000;LiuandBalakrishnan,2000;Ferrari-Trecateetal., 2003].Accordingly,wewilllaterprovidethedetailedliteraturereviewforeachof theapplicationsincludingthequadrotorandthesolarPVsystemsinChapters9 and8,respectively.

Bibliography

EdoardoAmaldi,StefanoConiglio,andLeonardoTaccari.Discreteoptimization methodstofitpiecewiseaffinemodelstodatapoints. Computers&Operations Research,75:214–230,2016.

GeorgeAndrikopoulos,GeorgeNikolakopoulos,IoannisArvanitakis,andStamatis Manesis.Piecewiseaffinemodelingandconstrainedoptimalcontrolfora pneumaticartificialmuscle. IEEETransactionsonIndustrialElectronics, 61(2):904–916,2013.

KaiArulkumaran,MarcPeterDeisenroth,MilesBrundage,andAnilAnthony Bharath.Deepreinforcementlearning:Abriefsurvey. IEEESignalProcessing Magazine,34(6):26–38,2017.

KarlJ.ÅströmandBjörnWittenmark. AdaptiveControl.CourierCorporation,2013. ChristopherG.AtkesonandJuanCarlosSantamaria.Acomparisonofdirectand model-basedreinforcementlearning.In ProceedingsoftheInternationalConference onRoboticsandAutomation,volume4,pages3557–3564.IEEE,1997.

S.N.Balakrishnan,JieDing,andFrankL.Lewis.IssuesonstabilityofADPfeedback controllersfordynamicalsystems. IEEETransactionsonSystems,Man,and Cybernetics,PartB(Cybernetics),38(4):913–917,2008.