Model-BasedReinforcementLearning
IEEEPress
445HoesLane
Piscataway,NJ08854
IEEEPressEditorialBoard
SarahSpurgeon, EditorinChief
JónAtliBenediktsson
AnjanBose
AdamDrobot
Peter(Yong)Lian
AndreasMolisch SaeidNahavandi
DiomidisSpinellis
AhmetMuratTekalp
JeffreyReed
ThomasRobertazzi
Model-BasedReinforcementLearning
FromDatatoContinuousActionswithaPython-based Toolbox
MiladFarsiandJunLiu
UniversityofWaterloo,Ontario,Canada
IEEEPressSeriesonControlSystemsTheoryandApplications MariaDomenicaDiBenedetto,SeriesEditor
Copyright©2023byTheInstituteofElectricalandElectronicsEngineers,Inc. Allrightsreserved.
PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey.
PublishedsimultaneouslyinCanada.
Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinany formorbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise, exceptaspermittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,without eitherthepriorwrittenpermissionofthePublisher,orauthorizationthroughpaymentofthe appropriateper-copyfeetotheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers, MA01923,(978)750-8400,fax(978)750-4470,oronthewebatwww.copyright.com.Requeststo thePublisherforpermissionshouldbeaddressedtothePermissionsDepartment,JohnWiley& Sons,Inc.,111RiverStreet,Hoboken,NJ07030,(201)748-6011,fax(201)748-6008,oronlineat http://www.wiley.com/go/permission.
Trademarks:WileyandtheWileylogoaretrademarksorregisteredtrademarksofJohnWiley& Sons,Inc.and/oritsaffiliatesintheUnitedStatesandothercountriesandmaynotbeused withoutwrittenpermission.Allothertrademarksarethepropertyoftheirrespectiveowners. JohnWiley&Sons,Inc.isnotassociatedwithanyproductorvendormentionedinthisbook.
LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbest effortsinpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttothe accuracyorcompletenessofthecontentsofthisbookandspecificallydisclaimanyimplied warrantiesofmerchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedor extendedbysalesrepresentativesorwrittensalesmaterials.Theadviceandstrategiescontained hereinmaynotbesuitableforyoursituation.Youshouldconsultwithaprofessionalwhere appropriate.Neitherthepublishernorauthorshallbeliableforanylossofprofitoranyother commercialdamages,includingbutnotlimitedtospecial,incidental,consequential,orother damages.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechanged ordisappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublisher norauthorsshallbeliableforanylossofprofitoranyothercommercialdamages,includingbut notlimitedtospecial,incidental,consequential,orotherdamages.
Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,please contactourCustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidethe UnitedStatesat(317)572-3993orfax(317)572-4002.
Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsin printmaynotbeavailableinelectronicformats.FormoreinformationaboutWileyproducts, visitourwebsiteatwww.wiley.com.
LibraryofCongressCataloging-in-PublicationDataappliedfor:
HardbackISBN:9781119808572
CoverDesign:Wiley
CoverImages:©Pobytov/GettyImages;Login/Shutterstock;SazhnievaOksana/Shutterstock
Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India
Contents
AbouttheAuthors xi
Preface xiii
Acronyms xv Introduction xvii
1NonlinearSystemsAnalysis 1
1.1Notation 1
1.2NonlinearDynamicalSystems 2
1.2.1RemarksonExistence,Uniqueness,andContinuationofSolutions 2
1.3LyapunovAnalysisofStability 3
1.4StabilityAnalysisofDiscreteTimeDynamicalSystems 7
1.5Summary 10 Bibliography 10
2OptimalControl 11
2.1ProblemFormulation 11
2.2DynamicProgramming 12
2.2.1PrincipleofOptimality 12
2.2.2Hamilton–Jacobi–BellmanEquation 14
2.2.3ASufficientConditionforOptimality 15
2.2.4Infinite-HorizonProblems 16
2.3LinearQuadraticRegulator 18
2.3.1DifferentialRiccatiEquation 18
2.3.2AlgebraicRiccatiEquation 23
2.3.3ConvergenceofSolutionstotheDifferentialRiccatiEquation 26
2.3.4ForwardPropagationoftheDifferentialRiccatiEquationforLinear QuadraticRegulator 28
2.4Summary 30 Bibliography 30
3ReinforcementLearning 33
3.1Control-AffineSystemswithQuadraticCosts 33
3.2ExactPolicyIteration 35
3.2.1LinearQuadraticRegulator 39
3.3PolicyIterationwithUnknownDynamicsandFunction Approximations 41
3.3.1LinearQuadraticRegulatorwithUnknownDynamics 46
3.4Summary 47 Bibliography 48
4LearningofDynamicModels 51
4.1Introduction 51
4.1.1AutonomousSystems 51
4.1.2ControlSystems 51
4.2ModelSelection 52
4.2.1Gray-Boxvs.Black-Box 52
4.2.2Parametricvs.Nonparametric 52
4.3ParametricModel 54
4.3.1ModelinTermsofBases 54
4.3.2DataCollection 55
4.3.3LearningofControlSystems 55
4.4ParametricLearningAlgorithms 56
4.4.1LeastSquares 56
4.4.2RecursiveLeastSquares 57
4.4.3GradientDescent 59
4.4.4SparseRegression 60
4.5PersistenceofExcitation 60
4.6PythonToolbox 61
4.6.1Configurations 62
4.6.2ModelUpdate 62
4.6.3ModelValidation 63
4.7ComparisonResults 64
4.7.1ConvergenceofParameters 65
4.7.2ErrorAnalysis 67
4.7.3RuntimeResults 69
4.8Summary 73 Bibliography 75
5StructuredOnlineLearning-BasedControlof Continuous-TimeNonlinearSystems 77
5.1Introduction 77
5.2AStructuredApproximateOptimalControlFramework 77
5.3LocalStabilityandOptimalityAnalysis 81
5.3.1LinearQuadraticRegulator 81
5.3.2SOLControl 82
5.4SOLAlgorithm 83
5.4.1ODESolverandControlUpdate 84
5.4.2IdentifiedModelUpdate 85
5.4.3DatabaseUpdate 85
5.4.4LimitationsandImplementationConsiderations 86
5.4.5AsymptoticConvergencewithApproximateDynamics 87
5.5SimulationResults 87
5.5.1SystemsIdentifiableinTermsofaGivenSetofBases 88
5.5.2SystemstoBeApproximatedbyaGivenSetofBases 91
5.5.3ComparisonResults 98
5.6Summary 99 Bibliography 99
6AStructuredOnlineLearningApproachtoNonlinear TrackingwithUnknownDynamics 103
6.1Introduction 103
6.2AStructuredOnlineLearningforTrackingControl 104
6.2.1StabilityandOptimalityintheLinearCase 108
6.3Learning-basedTrackingControlUsingSOL 111
6.4SimulationResults 112
6.4.1TrackingControlofthePendulum 113
6.4.2SynchronizationofChaoticLorenzSystem 114
6.5Summary 115 Bibliography 118
7PiecewiseLearningandControlwithStability Guarantees 121
7.1Introduction 121
7.2ProblemFormulation 122
7.3ThePiecewiseLearningandControlFramework 122
7.3.1SystemIdentification 123
7.3.2Database 124
7.3.3FeedbackControl 125
7.4AnalysisofUncertaintyBounds 125
7.4.1QuadraticProgramsforBoundingErrors 126
7.5StabilityVerificationforPiecewise-AffineLearningandControl 129
7.5.1PiecewiseAffineModels 129
7.5.2MIQP-basedStabilityVerificationofPWASystems 130
7.5.3ConvergenceofACCPM 133
7.6NumericalResults 134
7.6.1PendulumSystem 134
7.6.2DynamicVehicleSystemwithSkidding 138
7.6.3ComparisonofRuntimeResults 140
7.7Summary 142 Bibliography 143
8AnApplicationtoSolarPhotovoltaicSystems 147
8.1Introduction 147
8.2ProblemStatement 150
8.2.1PVArrayModel 151
8.2.2DC-DCBoostConverter 152
8.3OptimalControlofPVArray 154
8.3.1MaximumPowerPointTrackingControl 156
8.3.2ReferenceVoltageTrackingControl 162
8.3.3PiecewiseLearningControl 164
8.4ApplicationConsiderations 165
8.4.1PartialDerivativeApproximationProcedure 165
8.4.2PartialShadingEffect 167
8.5SimulationResults 170
8.5.1ModelandControlVerification 173
8.5.2ComparativeResults 174
8.5.3Model-FreeApproachResults 176
8.5.4PiecewiseLearningResults 178
8.5.5PartialShadingResults 179
8.6Summary 182 Bibliography 182
9AnApplicationtoLow-levelControlofQuadrotors 187
9.1Introduction 187
9.2QuadrotorModel 189
9.3StructuredOnlineLearningwithRLSIdentifieronQuadrotor 190
9.3.1LearningProcedure 191
9.3.2AsymptoticConvergencewithUncertainDynamics 195
9.3.3ComputationalProperties 195
9.4NumericalResults 197
9.5Summary 201 Bibliography 201
10PythonToolbox 205
10.1Overview 205
10.2UserInputs 205
10.2.1Process 206
10.2.2Objective 207
10.3SOL 207
10.3.1ModelUpdate 208
10.3.2Database 208
10.3.3Library 210
10.3.4Control 210
10.4DisplayandOutputs 211
10.4.1GraphsandPrintouts 213
10.4.23DSimulation 213
10.5Summary 214 Bibliography 214
AAppendix 215
A.1SupplementaryAnalysisofRemark5.4 215
A.2SupplementaryAnalysisofRemark5.5 222
Index 223
AbouttheAuthors
MiladFarsi receivedaB.S.degreeinElectricalEngineering(Electronics)from theUniversityofTabrizin2010.HeobtainedanM.S.degreealsoinElectrical Engineering(ControlSystems)fromSahandUniversityofTechnologyin2013. Moreover,hegainedindustrialexperienceasaControlSystemEngineerbetween 2012and2016.Later,heacquiredaPh.D.degreeinAppliedMathematicsfromthe UniversityofWaterloo,Canada,in2022andiscurrentlyaPostdoctoralFellowat thesameinstitution.Hisresearchinterestsincludecontrolsystems,reinforcement learning,andtheirapplicationsinroboticsandpowerelectronics.
JunLiu receivedaB.S.degreeinAppliedMathematicsfromShanghaiJiao-Tong Universityin2002,theM.S.degreeinMathematicsfromPekingUniversity in2005,andthePh.D.degreeinAppliedMathematicsfromtheUniversityof Waterloo,Canada,in2010.HeiscurrentlyanAssociateProfessorofApplied MathematicsandaCanadaResearchChairinHybridSystemsandControlat theUniversityofWaterloo,wherehedirectstheHybridSystemsLaboratory. From2012to2015,hewasaLecturerinControlandSystemsEngineeringat theUniversityofSheffield.During2011and2012,hewasaPostdoctoralScholar inControlandDynamicalSystemsattheCaliforniaInstituteofTechnology. Hismainresearchinterestsareinthetheoryandapplicationsofhybridsystems andcontrol,includingrigorouscomputationalmethodsforcontroldesignwith applicationsincyber-physicalsystemsandrobotics.
Preface
ThesubjectofReinforcementLearning(RL)ispopularlyassociatedwiththepsychologyofanimallearningthroughatrial-and-errormechanism.Theunderlying mathematicalprincipleofRLtechniques,however,isundeniablythetheoryof optimalcontrol,asexemplifiedbylandmarkresultsinthelate1950sondynamic programmingbyBellman,themaximumprinciplebyPontryagin,andtheLinear QuadraticRegulator(LQR)byKalman.Optimalcontrolitselfhasitsrootsinthe mucholdersubjectofcalculusofvariations,whichdatedbacktolate1600s.Pontryagin’smaximumprincipleandtheHamilton–Jacobi–Bellman(HJB)equation arethetwomainpillarsofoptimalcontrol,thelatterofwhichprovidesfeedback controlstrategiesthroughanoptimalvaluefunction,whereastheformercharacterizesopen-loopcontrolsignals.
ReinforcementlearningwasdevelopedbyBartoandSuttoninthe1980s, inspiredbyanimallearningandbehavioralpsychology.Thesubjecthasexperiencedaresurgenceofinterestinbothacademiaandindustryoverthepast decade,amongthenewexplosivewaveofAIandmachinelearningresearch.A notablerecentsuccessofRLwasintacklingtheotherwiseseeminglyintractable gameofGoanddefeatingtheworldchampionin2016.
Arguably,theproblemsoriginallysolvedbyRLtechniquesaremostlydiscretein nature.Forexample,navigatingmazesandplayingvideogames,whereboththe statesandactionsarediscrete(finite),orsimplecontroltaskssuchaspolebalancingwithimpulsiveforces,wheretheactions(controls)arechosentobediscrete. Morerecently,researchersstartedtoinvestigateRLmethodsforproblemswith bothcontinuousstateandactionspaces.Ontheotherhand,classicaloptimalcontrolproblemsbydefinitionhavecontinuousstateandcontrolvariables.Itseems naturaltosimplyformulateoptimalcontrolproblemsinamoregeneralwayand developRLtechniquestosolvethem.Nonetheless,therearetwomainchallenges insolvingsuchoptimalcontrolproblemsfromacomputationalperspective.First, mosttechniquesrequireexactoratleastapproximatemodelinformation.Second, thecomputationofoptimalvaluefunctionsandfeedbackcontrolsoftensuffers
xiv Preface
fromthecurseofdimensionality.Asaresult,suchmethodsareoftentooslowto beappliedinanonlinefashion.
Thebookwasmotivatedbythisverychallengeofdevelopingcomputationally efficientmethodsforonlinelearningoffeedbackcontrollersforcontinuous controlproblems.AmainpartofthisbookwasbasedonthePhDthesisofthe firstauthor,whichpresentedaStructuredOnlineLearning(SOL)framework forcomputingfeedbackcontrollersbyforwardintegrationofastate-dependent differentialRiccatiequationalongstatetrajectories.InthespecialcaseofLinear Time-Invariant(LTI)systems,thisreducestosolvingthewell-knownLQRproblemwithoutpriorknowledgeofthemodel.Thefirstpartofthebook(Chapters 1–3)providessomebackgroundmaterialsincludingLyapunovstabilityanalysis, optimalcontrol,andRLforcontinuouscontrolproblems.Theremainingpart (Chapters4–9)discussestheSOLframeworkindetail,coveringbothregulation andtrackingproblems,theirfurtherextensions,andvariouscasestudies.
Thefirstauthorwouldliketoconveyhisheartfeltthankstothosewhoencouragedandsupportedhimduringhisresearch.Thesecondauthorisgratefulto thementors,students,colleagues,andcollaboratorswhohavesupportedhim throughouthiscareer.Wegratefullyacknowledgefinancialsupportforthe researchthroughtheNaturalSciencesandEngineeringResearchCouncilof Canada,theCanadaResearchChairsProgram,andtheOntarioEarlyResearcher AwardProgram.
Waterloo,Ontario,Canada April2022
MiladFarsiandJunLiu
Acronyms
ACCPManalyticcentercutting-planemethod
AREalgebraicRiccatiequation
DNNdeepneuralnetwork
DPdynamicprogramming
DREdifferentialRiccatiequation
FPREforward-propagatingRiccatiequation
GDgradientdescent
GUASgloballyuniformlyasymptoticallystable
GUESgloballyuniformlyexponentiallystable
HJBHamilton–Jacobi–Bellman
KDEkerneldensityestimation
LMSleastmeansquares
LQRlinearquadraticregulator
LQTlinearquadratictracking
LSleastsquare
LTIlineartime-invariant
MBRLmodel-basedreinforcementlearning
MDPMarkovdecisionprocess
MIQPmixed-integerquadraticprogram
MPCmodelpredictivecontrol
MPPmaximumpowerpoint
MPPTmaximumpowerpointtracking
NNneuralnetwork
ODEordinarydifferentialequation
PDEpartialdifferentialequation
PEpersistenceofexcitation
PIpolicyiteration
PVPhotovoltaic
PWApiecewiseaffine
xvi Acronyms
PWMpulse-widthmodulation
RLreinforcementlearning
RLSrecursiveleastsquares
RMSErootmeansquareerror
ROAregionofattraction
SDREstate-dependentRiccatiequations
SINDysparseidentificationofnonlineardynamics
SMCslidingmodecontrol
SOLstructuredonlinelearning
SOSsumofsquares
TDtemporaldifference
UASuniformlyasymptoticallystable
UESuniformlyexponentiallystable
VIvalueiteration
I.1BackgroundandMotivation
I.1.1LackofanEfficientGeneralNonlinearOptimalControl Technique
Optimalcontroltheoryplaysanimportantroleindesigningeffectivecontrol systems.Forlinearsystems,aclassofoptimalcontrolproblemsaresolved successfullyundertheframeworkofLinearQuadraticRegulator(LQR).LQR problemsareconcernedwithminimizingaquadraticcostforlinearsystemsin termsofthecontrolinputandstate,solvingwhichallowsustoregulatethestate andthecontrolinputofthesystem.Incontrolsystemsapplications,thisprovides anopportunitytospecificallyregulatethebehaviorofthesystembyadjusting theweightingcoefficientsusedinthecostfunctional.However,whenitturns tononlineardynamicalsystems,thereisnosystematicmethodforefficiently obtaininganoptimalfeedbackcontrolforthegeneralnonlinearsystems.Thus, manyofthetechniquesavailableintheliteratureonlinearsystemsdonotapply ingeneral.
Despitethecomplexityofnonlineardynamicalsystems,theyhaveattracted muchattentionfromresearchersinrecentyears.Thisismostlybecauseoftheir practicalbenefitsinestablishingawidevarietyofapplicationsinengineering, includingpowerelectronics,flightcontrol,androbotics,amongmanyothers. Consideringthecontrolofageneralnonlineardynamicalsystem,optimalcontrol involvesfindingacontrolinputthatminimizesacostfunctionalthatdepends onthecontrolledstatetrajectoryandthecontrolinput.Whilesuchaproblem formulationcancoverawiderangeofapplications,howtoefficientlysolvesuch problemsremainsatopicofactiveresearch.
I.1.2ImportanceofanOptimalFeedbackControl
Ingeneral,thereexisttwowell-knownapproachestosolvingsuchoptimalcontrolproblems:themaximum(orminimum)principles[Pontryagin,1987]andthe DynamicProgramming(DP)method[BellmanandDreyfus,1962].Tosolvean optimizationproblemthatinvolvesdynamics,maximumprinciplesrequireusto solveatwo-pointboundaryvalueproblem,wherethesolutionisnotinafeedbackform.
Thereexistplentyofnumericaltechniquespresentedintheliteraturetosolve theoptimalcontrolproblem.Suchapproachesgenerallyrelyonknowledgeof theexactmodelofthesystem.Inthecasewheresuchamodelexists,theoptimal controlinputisobtainedintheopen-loopformasatime-dependentsignal. Consequently,implementingtheseapproachesinreal-worldproblemsoften involvesmanycomplicationsthatarewellknownbythecontrolcommunity.This isbecauseofthemodelmismatch,noises,anddisturbancesthatgreatlyaffect theonlinesolution,causingittodivergefromthepreplannedofflinesolution. Therefore,obtainingaclosed-loopsolutionfortheoptimalcontrolproblemis oftenpreferredinsuchapplications.
TheDPapproachanalyticallyresultsinafeedbackcontrolforlinearsystems withaquadraticcost.Moreover,employingtheHamilton-Jacobi-Bellman(HJB) equationwithavaluefunction,onemightmanagetoderiveanoptimalfeedback controlruleforsomereal-worldapplications,providedthatthevaluefunction canbeupdatedinanefficientmanner.Thismotivatesustoconsiderconditions leadingtoanoptimalfeedbackcontrolrulethatcanbeefficientlyimplementedin real-worldproblems.
I.1.3LimitsofOptimalFeedbackControlTechniques
Consideranoptimalcontrolproblemoveraninfinitehorizoninvolvinganonquadraticperformancemeasure.Usingtheideaofinverseoptimalcontrol,the costfunctionalcanbethenbeevaluatedinclosedformaslongastherunning costdependssomehowonanunderlyingLyapunovfunctionbywhichtheasymptoticstabilityofthenonlinearclosed-loopsystemisguaranteed.Thenitcanbe obtainedthattheLyapunovfunctionisindeedthesolutionofthesteady-stateHJB equation.Althoughsuchaformulationallowsanalyticallyobtaininganoptimal feedbackrule,choosingtheproperperformancemeasuremaynotbetrivial.Moreover,fromapracticalpointofview,becauseofthenonlinearityintheperformance measure,itmightcauseunpredictablebehavior.
Awell-studiedmethodforsolvinganoptimalcontrolproblemonlineisemployingavaluefunctionassumingagivenpolicy.Then,foranystate,thevaluefunction givesameasureofhowgoodthestateisbycollectingthecoststartingfromthat statewhilethepolicyisapplied.Ifsuchavaluefunctioncanbeobtained,andthe systemmodelisknown,theoptimalpolicyisactuallytheonethattakesthesystem inthedirectionbywhichthevaluedecreasesthemostinthespaceofthestates. SuchReinforcementLearning(RL)techniques,whichareknownasvalue-based methods,includingtheValueIteration(VI)andthePolicyIteration(PI)algorithms,areshowntobeeffectiveinfinitestateandcontrolspaces.However,the computationscannotefficientlyscalewiththesizeofthestateandcontrolspaces.
I.1.4ComplexityofApproximateDPAlgorithms
Onewayoffacilitatingthecomputationsregardingthevalueupdatesisemployinganapproximatescheme.Thisisdonebyparameterizingthevaluefunction andadjustingtheparametersinthetrainingprocess.Then,theoptimalpolicy givenbythevaluefunctionisalsoparameterizedandapproximatedaccordingly. Thecomplexityofanyvalueupdatedependsdirectlyonthenumberofparameters employed,whereonemaytrylimitingthenumberoftheparametersbysacrificingtheoptimality.Therefore,wearemotivatedtoobtainamoreefficientupdate ruleforthevalueparameters,ratherthanlimitingthenumberoftheparameters. Weachievethisbyreformulatingtheproblemwithaquadraticallyparameterized valuefunction.
Moreover,theclassicalVIalgorithmdoesnotexplicitlyusethesystemmodel forevaluatingthepolicy.Thisbenefitsapplicationsinthatthefullknowledge ofthesystemdynamicsisnolongerrequired.However,onlinetrainingwithVI alonemaytakemuchlongertimetoconverge,sincethemodelonlyparticipates implicitlythroughthefuturestate.Therefore,thelearningprocesscanbepotentiallyacceleratedbyintroducingthesystemmodel.Furthermore,thiscreatesan opportunityforrunningaseparateidentifierunit,wherethemodelobtainedcan besimulatedofflinetocompletethetrainingorcanbeusedforlearningoptimal policiesfordifferentobjectives.
ItcanbeshownthattheVIalgorithmforlinearsystemsresultsinaLyapunov recursioninthepolicyevaluationstep.SuchaLyapunovequationintermsof thesystemmatricescanbeefficientlysolved.However,forthegeneralnonlinear case,methodsforobtaininganequivalentarenotamenabletoefficientsolutions. Hence,wearemotivatedtoinvestigatethepossibilityofacquiringanefficient updaterulefornonlinearsystems.
I.1.5ImportanceofLearning-basedTrackingApproaches
Oneofthemostcommonproblemsinthecontrolofdynamicalsystemsistotracka desiredreferencetrajectory,whichisfoundinavarietyofreal-worldapplications. However,designinganefficienttrackingcontrollerusingconventionalmethods oftennecessitatesathoroughunderstandingofthemodel,aswellascomputationsandconsiderationsforeachapplication.RLapproaches,ontheotherhand, proposeamoreflexibleframeworkthatrequireslessinformationaboutthesystem dynamics.Whilethismaycreateadditionalproblems,suchassafetyorcomputinglimits,therearealreadyeffectiveoutcomesfromtheuseofsuchapproachesin real-worldsituations.Similartoregulationproblems,theapplicationsoftracking controlcanbenefitfromModel-basedReinforcementLearning(MBRL)thatcan handletheparameterupdatesmoreefficiently.
I.1.6OpportunitiesforObtainingaReal-timeControl
Intheapproximateoptimalcontroltechnique,employingalimitednumberof parameterscanonlyyieldalocalapproximationofthemodelandthevaluefunction.However,ifanapproximationwithinalargerdomainisintended,aconsiderablyhighernumberofparametersmaybeneeded.Asaresult,theidentification andthecontroller’scomplexitymightberathertoohightobeperformedonline inreal-worldapplications.Thisconvincesustocircumventthisconstraintbyconsideringasetoflocalsimplelearnersinstead,inapiecewiseapproach.
Asmentioned,thereexistalreadyinterestingreal-worldapplicationsofMBRL. Motivatedbythis,inthismonograph,weaimonintroducingautomatedwaysof solvingoptimalcontrolproblemsthatcanreplacetheconventionalcontrollers. Hence,detailedapplicationsoftheproposedapproachesareincluded,whichare demonstratedwithnumericalsimulations.
I.1.7Summary
Themainmotivationforthismonographcanbesummarizedasfollows:
● Optimalcontrolishighlyfavored,whilethereisnogeneralanalyticaltechnique applicabletoallnonlinearsystems.
● Feedbackcontroltechniquesareknowntobemorerobustandcomputationally efficientcomparedtothenumericaltechniques,especiallyinthecontinuous space.
● Thechanceofobtainingafeedbackcontrolinclosedformislow,andtheknown techniquesarelimitedtosomespecialclassesofsystems.
● ApproximateDPprovidesasystematicwayofobtaininganoptimalfeedback control,whilethecomplexitygrowssignificantlywiththenumberofparameters.
● Anefficientparameterizationoftheoptimalvaluemayprovideanopportunityformorecomplexreal-timeapplicationsincontrolregulationandtracking problems.
I.1.8OutlineoftheBook
Wesummarizethemaincontentsofthebookasfollows:
● Chapter1introducesLyapunovstabilityanalysisofnonlinearsystems,which areusedinsubsequentchaptersforanalyzingtheclosed-loopperformanceof thefeedbackcontrollers.
● Chapter2formulatestheoptimalcontrolproblemandintroducesthebasicconceptsofusingtheHJBequationtocharacterizeoptimalfeedbackcontrollers, whereLQRistreatedasaspecialcase.Afocusisonoptimalfeedbackcontrollersforasymptoticstabilizationtaskswithaninfinite-horizonperformance criterion.
● Chapter3discussesPIasaprominentRLtechniqueforsolvingcontinuousoptimalcontrolproblems.PIalgorithmsforbothlinearandnonlinearsystemswith andwithoutanyknowledgeofthesystemmodelarediscussed.Proofsofconvergenceandstabilityanalysisareprovidedinaself-containedmanner.
● Chapter4presentsdifferenttechniquesforlearningadynamicmodelfor continuouscontrolintermsofasetofbasisfunctions,includingleastsquares, recursiveleastsquares,gradientdescent,andsparseidentificationtechniques forparameterupdates.Comparisonresultsareshownusingnumerical examples.
● Chapter5introducestheStructuredOnlineLearning(SOL)frameworkforcontrol,includingthealgorithmandlocalanalysisofstabilityandoptimality.The focusisonregulationproblems.
● Chapter6extendstheSOLframeworktotrackingwithunknowndynamics. SimulationresultsaregiventoshowtheeffectivenessoftheSOLapproach. NumericalresultsoncomparisonwithalternativeRLapproachesarealso shown.
● Chapter7presentsapiecewiselearningframeworkworkasafurtherextension oftheSOLapproach,wherewelimittolinearbases,whileallowingmodelsto belearnedinapiecewisefashion.Accordingly,closed-loopstabilityguarantees areprovidedwithLyapunovanalysisfacilitatedbyMixed-IntegerQuadraticProgram(MIQP)-basedverification.
● Chapters8and9presenttwocasestudiesonPhotovoltaic(PV)andquadrotor systems.Chapter10introducestheassociatedPython-basedtoolforSOL.
Itshouldbenotedthat,someofthecontentsofchapters5–9havebeenpreviouslypublishedinFarsiandLiu[2020,2021],Farsietal.[2022],FarsiandLiu [2022b,2019],andtheyareincludedinthisbookwiththepermissionofthecited publishers.
I.2LiteratureReview
I.2.1ReinforcementLearning
RLisawell-knownclassofmachinelearningmethodsthatareconcernedwith learningtoachieveaparticulartaskthroughinteractionswiththeenvironment. Thetaskisoftendefinedbysomerewardmechanism.Theintelligentagenthas totakeactionsindifferentsituations.Then,therewardaccumulatedisusedasa measuretoimprovetheagent’sactionsinfuture,wheretheobjectiveistoaccumulateasmuchasrewardsaspossibleoversometime.Therefore,itisexpectedthat theagent’sactionsapproachtheoptimalbehaviorinalongterm.RLhasgaineda lotofsuccessesinthesimulationenvironment.However,thelackofexplainability [Dulac-Arnoldetal.,2019]anddataefficiency[Duanetal.,2016]makethem lessfavorableasanonlinelearningtechniquethatcanbedirectlyemployedin real-worldproblems,unlessthereexistsawaytosafelytransfertheexperience fromsimulation-basedlearningtotherealworld.Themainchallengesinthe implementationsoftheRLtechniquesarediscussedinDulac-Arnoldetal.[2019]. Numerousstudiesaredoneonthissubject;see,e.g.SuttonandBarto[2018], WieringandVanOtterlo[2012],Kaelblingetal.[1996],andArulkumaranetal. [2017]foralistofrelatedworks.RLhasfoundavarietyofinterestingapplications inrobotics[Koberetal.,2013],multiagentsystems[Zhangetal.,2021;DaSilva andCosta,2019;Hernandez-Lealetal.,2019],powersystems[Zhangetal.,2019; Yangetal.,2020],autonomousdriving[Kiranetal.,2021]andintelligenttransportation[HaydariandYilmaz,2020],andhealthcare[Yuetal.,2021],among others.
I.2.2Model-basedReinforcementLearning
MBRLtechniques,asopposedtomodel-freemethodsinlearning,areknown tobemoredataefficient.Directmodel-freemethodsusuallyrequireenormous dataandhoursoftrainingevenforsimpleapplications[Duanetal.,2016],
whilemodel-basedtechniquescanshowoptimalbehaviorinalimitednumber oftrials.Thisproperty,inadditiontotheflexibilitiesinchanginglearning objectivesandperformingfurthersafetyanalysis,makesthemmoresuitablefor real-worldimplementations,suchasrobotics[PolydorosandNalpantidis,2017].
Inmodel-basedapproaches,havingadeterministicorprobabilisticdescriptionof thetransitionsystemsavesmuchoftheeffortspentbydirectmethodsintreating anypointinthestate-controlspaceindividually.Hence,theroleofmodel-based techniquesbecomesevenmoresignificantwhenitcomestoproblemswith continuouscontrolsratherthandiscreteactions[Sutton,1990;Atkesonand Santamaria,1997;Powell,2004].
InMoerlandetal.[2020],theauthorsprovideasurveyofsomerecentMBRL methodswhichareformulatedbasedonMarkovDecisionProcesses(MDPs).In general,thereexisttwoapproachesforapproximatingasystem:parametricand nonparametric.Parametricmodelsareusuallypreferredovernonparametric, sincethenumberoftheparametersisindependentofthenumberofsamples. Therefore,theycanbeimplementedmoreefficientlyoncomplexsystems,where manysamplesareneeded.Ontheotherhand,innonparametricapproaches,the predictionforagivensampleisobtainedbycomparingitwithasetofsamples alreadystored,whichrepresentthemodel.Therefore,thecomplexityincreases withthesizeofthedataset.Inthisbook,becauseofthisadvantageofparametric models,wefocusontheparametrictechniques.
I.2.3OptimalControl
LetusspecificallyconsiderimplementationsofRLoncontrolsystems.Regardless ofthefactthatRLtechniquesdonotrequirethedynamicalmodeltosolvethe problem,theyareinfactintendedtofindasolutionfortheoptimalcontrolproblem.Thisproblemisextensivelyinvestigatedbythecontrolcommunity.TheLQR problemhasbeensolvedsatisfactorilyforlinearsystemsusingRiccatiequations [Kalman,1960],whichalsoensuresystemstabilityforinfinite-horizonproblems. However,inthecaseofnonlinearsystems,obtainingsuchasolutionisnottrivialandrequiresustosolvetheHJBequation,eitheranalyticallyornumerically, whichisachallengingtask,especiallywhenwedonothaveknowledgeofthe systemmodel.
ModelPredictiveControl(MPC)[CamachoandAlba,2013;Garciaetal.,1989; QinandBadgwell,2003;GrüneandPannek,2017;Garciaetal.,1989;Mayneand Michalska,1988;MorariandLee,1999]hasbeenfrequentlyusedasanoptimal controltechnique,whichisinherentlymodel-based.Furthermore,itdealswith thecontrolproblemonlyacrossarestrictedpredictionhorizon.Forthisreason,
andforthefactthattheproblemisnotconsideredintheclosed-loopform,stability analysisishardtoestablish.Forthesamereasons,theonlinecomputationalcomplexityisconsiderablyhigh,comparedtoafeedbackcontrolrulethatcanbeefficientlyimplemented.
Forward-PropagatingRiccatiEquation(FPRE)[Weissetal.,2012;Prachetal., 2015]isoneofthetechniquespresentedforsolvingtheLQRproblem.Normally, theDifferentialRiccatiEquation(DRE)issolvedbackwardfromafinalcondition. Inananalogoustechnique,itcanbesolvedinforwardtimewithsomeinitialconditioninstead.AcomparisonbetweenthesetwoschemesisgiveninPrachetal. [2015].Employingforward-integrationmethodsmakesitsuitableforsolvingthe problemfortime-varyingsystems[Weissetal.,2012;ChenandKao,1997]orinthe RLsetting[Lewisetal.,2012],sincethefuturedynamicsarenotneeded,whereas thebackwardtechniquerequirestheknowledgeofthefuturedynamicsfromthe finalcondition.FPREhasbeenshowntobeanefficienttechniqueforfindinga suboptimalsolutionforlinearsystems,while,fornonlinearsystems,theassumptionisthatthesystemislinearizedalongthesystem’strajectories.
State-dependentRiccatiEquations(SDRE)[Çimen,2008;ErdemandAlleyne, 2004;Cloutier,1997]isanothertechniquethatcanbefoundintheliteraturefor solvingtheoptimalcontrolproblemfornonlinearsystems.Thistechniquerelies onthefactthatanynonlinearsystemcanbewrittenintheformofalinearsystem withstate-dependentmatrices.However,thisconversionisnotunique.Hence,a suboptimalsolutionisexpected.SimilartoMPC,itdoesnotyieldafeedbackcontrolrulesincethecontrolateachstateiscomputedbysolvingaDREthatdepends onthesystem’strajectory.
I.2.4DynamicProgramming
Othermodel-basedapproachescanbefoundintheliteraturethataremainly categorizedunderRLintwogroups:valuefunctionandpolicysearchmethods. Invaluefunction-basedmethods,knownalsoasapproximate/adaptiveDP techniques[Wangetal.,2009;LewisandVrabie,2009;Balakrishnanetal., 2008],avaluefunctionisusedtoconstructthepolicy.Ontheotherhand, policysearchmethodsdirectlyimprovethepolicytoachieveoptimality.AdaptiveDPhasfounddifferentapplications[Prokhorov,2008;Ferrari-Trecate etal.,2003;Prokhorovetal.,1995;Murrayetal.,2002;Yuetal.,2014;Han andBalakrishnan,2002;Lendarisetal.,2000;LiuandBalakrishnan,2000; Ferrari-Trecateetal.,2003]inautomotivecontrol,flightcontrol,powercontrol, amongothers.AreviewofrecenttechniquescanbefoundinKalyanakrishnan andStone[2009],Busoniuetal.[2017],Recht[2019],PolydorosandNalpantidis
[2017],andKamalapurkaretal.[2018].TheQ-learningapproachlearnsan action-dependentfunctionusingTemporalDifference(TD)toobtaintheoptimal policy.Thisisinherentlyadiscreteapproach.Therearecontinuousextensions ofthistechnique,suchas[Millánetal.,2002;Gaskettetal.,1999;Ryuetal., 2019;Weietal.,2018].However,foranefficientimplementation,thestate andactionspacesoughttobefinite,whichishighlyrestrictiveforcontinuous problems.
Adaptivecontrollers[ÅströmandWittenmark,2013],asawell-knownclass ofcontroltechniques,mayseemsimilartoRLinmethodology,whilethereare substantialdifferencesintheproblemformulationandobjectives.Adaptivetechniques,aswellasRL,learntoregulateunknownsystemsutilizingdatacollected inrealtime.Infact,anRLtechniquecanbeseenasanadaptivetechniquethat convergestotheoptimalcontrol[LewisandVrabie,2009].However,asopposed toRLandoptimalcontrollers,adaptivecontrollersarenotnormallyintendedto beoptimal,withrespecttoauser-specifiedcostfunction.Hence,wewillnotdraw directcomparisonswithsuchmethods.
ValuemethodsinRLnormallyrequiresolvingthewell-knownHJB.However,commontechniquesforsolvingsuchequationssufferfromthecurseof dimensionality.Hence,inapproximateDPtechniques,aparametricornonparametricmodelisusedtoapproximatethesolution.InLewisandVrabie[2009], somerelatedapproachesarereviewedthatfundamentallyfollowtheactor-critic structure[Bartoetal.,1983],suchasVIandPIalgorithms.
Insuchapproaches,theBellmanerror,whichisobtainedfromtheexplorationof thestatespace,isusedtoimprovetheparametersestimatedinagradient-descent orleast-squaresloopthatrequiresthePersistenceofExcitation(PE)condition. SincetheBellmanerrorobtainedisonlyvalidalongthetrajectoriesofthesystem,sufficientexplorationinthestatespaceisrequiredtoefficientlyestimatethe parameters.InKamalapurkaretal.[2018],theauthorshaverevieweddifferent strategiesemployedtoincreasethedataefficiencyinexploration.InVamvoudakis etal.[2012]andModaresetal.[2014],aprobingsignalisaddedtothecontrol toenhancetheexploringpropertiesofthepolicy.Inanotherapproach[Modares etal.,2014],therecordeddataofexplorationsareusedasareplayofexperienceto increasethedataefficiency.Accordingly,themodelobtainedfromidentification isusedtoacquiremoreexperiencebydoingsimulationinanofflineroutinethat decreasestheneedforvisitinganypointinthestatespace.
Asanalternativemethod,consideringanonlinearcontrol-affinesystemwith aknowninputcouplingfunction,theworkKamalapurkaretal.[2016b]used aparametricmodeltoapproximatethevaluefunction.Then,theyemployeda least-squaresminimizationtechniquetoadjusttheparametersaccordingtothe
Bellmanerror,whichcanbecalculatedatanyarbitrarypointofthestatespace usingidentifiedinternaldynamicsofthesystemandapproximatedstatederivativesunderaPE-likerankcondition.InKamalapurkaretal.[2016a],theauthors proposedanimprovedtechnique,whichapproximatesthevaluefunctiononlyin asmallneighborhoodofthecurrentstate.Ithasbeenshownthatthelocalapproximationcanbedonemoreefficientlysinceaconsiderablylessnumberofbasescan beused.
IntheworkbyJiangandJiang[2012],JiangandJiang[2014],andJiangand Jiang[2017],theauthorsproposedPI-basedalgorithmsthatdonotrequireany priorknowledgeofthesystemmodel.AsimilarPE-likerankconditionwasused toensuresufficientexplorationforsuccessfullearningofthevaluefunctionsand controllers.Itisshownthatthesealgorithmscanachievesemiglobalstabilization andconvergencetooptimalvaluesandcontrollers.Oneofthemainlimitations ofPI-basedalgorithmsisthataninitialstabilizingcontrollerhastobeprovided. WhileconvergentPIalgorithmshavebeenrecentlyprovedfordiscrete-timesystemsinmoregeneralsettings[Bertsekas,2017],itsextensiontocontinuous-time systemsinvolvessubstantialtechnicaldifficultiesaspointedoutbyBertsekas [2017].
I.2.5PiecewiseLearning
Thereexistdifferenttechniquestoefficientlyfitapiecewisemodeltodata, see,e.g.TorielloandVielma[2012],Breschietal.[2016],Ferrari-Trecateetal. [2003],Amaldietal.[2016],RebennackandKrasko[2020],andDuetal.[2021]. InFerrari-Trecateetal.[2003],atechniquefortheidentificationofdiscrete-time hybridsystemsbythepiecewiseaffinemodelispresented.Thealgorithmcombinesclustering,linearidentification,andpatternrecognitionapproachesto identifytheaffinesubsystemstogetherwiththepartitionsforwhichtheyapply. Infact,theproblemofgloballyfittingapiecewiseaffinemodelisconsideredto becomputationallyexpensivetosolve.InLauer[2015],itisdiscussedthatglobal optimalitycanbereachedwithapolynomialcomplexityinthenumberofdata, whileitisexponentialwithrespecttothedatadimension.Inthisregard,thework byBreschietal.[2016]presentsanefficienttwo-steptechnique:first,recursively clusteringoftheregressorvectorsandestimationofthemodelparameters,and second,computationofapolyhedralpartition.Areviewofsomeofthetechniques canbefoundinGambellaetal.[2021]andGarullietal.[2012].
TheflexibilityofPiecewiseAffine(PWA)systemsmakesthemsuitablefor differentapproachesincontrol.Hence,thecontrolproblemofpiecewisesystems isextensivelystudiedintheliterature;see,e.g.MarcucciandTedrake[2019],Zou
andLi[2007],RodriguesandBoyd[2005],Baotic[2005],Strijboschetal.[2020], Christophersenetal.[2005],andRodriguesandHow[2003].Moreover,various applicationscanbefoundforPWAsystems,includingrobotics[Andrikopoulos etal.,2013;Marcuccietal.,2017],automotivecontrol[Borrellietal.,2006;Sun etal.,2019],andpowerelectronics[Geyeretal.,2008;Vladetal.,2012].In ZouandLi[2007],therobustMPCstrategyisextendedtoPWAsystemswith polytopicuncertainty,wheremultiplePWAquadraticLyapunovfunctionsare employedfordifferentverticesoftheuncertaintypolytopeindifferentpartitions. InanotherworkbyMarcucciandTedrake[2019],hybridMPCisformulatedas amixed-integerprogramtosolvetheoptimalcontrolproblemforPWAsystems. However,thesetechniquesareonlyavailableinanopen-loopform,which decreasestheirapplicabilityforreal-timecontrol.
Ontheotherhand,DeepNeuralNetwork(DNN)offersanefficienttechnique forcontrolinclosedloop.However,onedrawbackofDNN-basedcontrolisthe difficultyinstabilityanalysis.ThisbecomesevenmorechallengingwhenPWAare considered.TheworkbyChenetal.[2020]suggestedasample-efficienttechnique forsynthesizingaLyapunovfunctionforthePWAsystemcontrolledthrougha DNNinclosedloop.Inthisapproach,AnalyticCenterCutting-PlaneMethod (ACCPM)[GoffinandVial,1993;Nesterov,1995;BoydandVandenberghe,2004] isfirstusedforsearchingforaLyapunovfunction.Then,thisLyapunovfunction candidateisverifiedontheclosed-loopsystemusinganMIQP.Thisapproach reliesonourknowledgeoftheexactmodelofthesystem.Hence,itcannotbe directlyimplementedonanidentifiedPWAsystemwithuncertainty.
I.2.6TrackingControl
Forthelearning-basedtrackingproblem,severaltechniquescanbefoundinthe literature,inadditiontosomeextensionspresentedforthetechniquesreviewed byModaresandLewis[2014],Modaresetal.[2015],Zhuetal.[2016],Yangetal. [2016],andLuoetal.[2016].ModaresandLewis[2014]havedevelopedanintegral RLtechniqueforlinearsystemsbasedonPIalgorithm,startingwithanadmissible initialcontroller.Ithasbeenshownthattheoptimaltrackingcontrollerconverges totheLinearQuadraticTracking(LQT)controller,withapartiallyunknownsystem.InModaresetal.[2015],anoff-policymethodisemployedwiththreeneuralnetworksinanactor-critic-disturbanceconfigurationtolearnan H∞ -tracking controllerforunknownnonlinearsystems.Zhuetal.[2016]constructedanaugmentedsystemusingthetrackingerrorandthereference.Neuralnetworkswere employed,inanactor-criticstructure,toapproximatethevaluefunctionandlearn anoptimalpolicy.Inanotherneuralnetwork-basedapproach[Yangetal.,2016],a
singlenetworkwasusedtoapproximatethevaluefunction,whereclassesofuncertaindynamicswereassumed.Inadditiontotheaboveapproaches,thereexistother similaronesintheliterature.However,applicationsofRLintrackingcontrolare notonlylimitedtomodel-basedtechniques.Forinstance,Luoetal.[2016]suggests acritic-onlyQ-learningapproachfortrackingproblems,whichdoesnotrequire solvingtheHJBequation.
I.2.7Applications
Asmentioned,thereexistdifferentapplicationsofMBRL,aswellastheoptimal control,onreal-worldproblems[Prokhorov,2008;Ferrari-Trecateetal.,2003; Prokhorovetal.,1995;Murrayetal.,2002;Yuetal.,2014;HanandBalakrishnan, 2002;Lendarisetal.,2000;LiuandBalakrishnan,2000;Ferrari-Trecateetal., 2003].Accordingly,wewilllaterprovidethedetailedliteraturereviewforeachof theapplicationsincludingthequadrotorandthesolarPVsystemsinChapters9 and8,respectively.
Bibliography
EdoardoAmaldi,StefanoConiglio,andLeonardoTaccari.Discreteoptimization methodstofitpiecewiseaffinemodelstodatapoints. Computers&Operations Research,75:214–230,2016.
GeorgeAndrikopoulos,GeorgeNikolakopoulos,IoannisArvanitakis,andStamatis Manesis.Piecewiseaffinemodelingandconstrainedoptimalcontrolfora pneumaticartificialmuscle. IEEETransactionsonIndustrialElectronics, 61(2):904–916,2013.
KaiArulkumaran,MarcPeterDeisenroth,MilesBrundage,andAnilAnthony Bharath.Deepreinforcementlearning:Abriefsurvey. IEEESignalProcessing Magazine,34(6):26–38,2017.
KarlJ.ÅströmandBjörnWittenmark. AdaptiveControl.CourierCorporation,2013. ChristopherG.AtkesonandJuanCarlosSantamaria.Acomparisonofdirectand model-basedreinforcementlearning.In ProceedingsoftheInternationalConference onRoboticsandAutomation,volume4,pages3557–3564.IEEE,1997.
S.N.Balakrishnan,JieDing,andFrankL.Lewis.IssuesonstabilityofADPfeedback controllersfordynamicalsystems. IEEETransactionsonSystems,Man,and Cybernetics,PartB(Cybernetics),38(4):913–917,2008.